# "Scikit-learn's Metadata Routing API"

### 1. What is Metadata?

- definition of metadata:
    - metadata can be any data, that we want to apply on top of our tabular data, without it necessarily being part of it
    - alternative: metadata is any data, that some object or function in a data science "pipeline/process" handles besides from X and y; it's a param that influences this "step's" treatment of the data

- examples for metadata:
    - classical examples you might know from scikit-learn: sample_weight and groups
    - but other libraries offer support for other kinds of metadata
    - (graphically show some other examples of metadata: gender, race, sex, zipcode, .... and area/library it is used in)
    - self defined metadata to be used in custom metrics (and possibly custom estimators)

- use cases for metadata:
    - sample_weight and groups can be used to balance data out and prevent data leakage
    - fairness related use case
    - business logic

- definition of routing:
    - routing just means that we pass metadata around (or through uninvolved steps) in a data science pipeline/process to where it is used/consumed

- before metadata routing API:
    - we were limited to where sample_weight was defined because it was the only param defined in the metrics (other than y_true and y_pred)
    - we could not consistently use it if we were using the metric in a larger structure

- with metadata routing API:
    - we can pass sample_weight and groups through several levels of estimators and pipes in scikit-learn
    - we can combine objects from other libraries with scikit-learn estimators while still passing their metadata
    - we can define our own custom metrics using self defined metadata



### 2. Passing metadata without the routing API


In [1]:
import pandas as pd
data = pd.DataFrame({"sex":[1,0,1,1,0], "age":[17,32,82,27,54], "race":[1,0,0,1,0], "severity":[4,8,2,9,5], "medication":[1,1,1,0,0], "recovery_time":[10,22,90,32,5]})
data

Unnamed: 0,sex,age,race,severity,medication,recovery_time
0,1,17,1,4,1,10
1,0,32,0,8,1,22
2,1,82,0,2,1,90
3,1,27,1,9,0,32
4,0,54,0,5,0,5


In [2]:
X = data.iloc[:, :-1].to_numpy()
y = data.iloc[:,-1].to_numpy()

- example: medical study on the effectiveness of a treatment
    - we would want to group the hospitals using the groups parameter scikit-learn offers
        - since a hospital is a collection of patients
        - we suspect that each hospital’s data may have systematic biases due to factors like medical devices, policies, socioeconomic status of the patients, ...
    - same hospital should not be both in the train and in the test set
    - groups is an array of length n_samples that assigns each sample into a group and it is used in splitters exclusively, to make sure that if patterns exist within the data, we don’t leak the patterns between train and test set, because we want to train our models on the targets and not on other patterns within the data

    - before we had to do a groupKFold


In [28]:
import numpy as np

# more real data
rng = np.random.RandomState(42)

X = rng.rand(200, 5)
y = rng.randint(0, 2, size=X.shape[0])

groups = rng.randint(0, 10, size=X.shape[0])
sample_weight = rng.rand(X.shape[0])

In [35]:
from sklearn.model_selection import GroupKFold
from sklearn.linear_model import Ridge
from sklearn.model_selection import cross_validate

cv = GroupKFold(n_splits=2)

cross_validate(Ridge(), X, y, groups=groups, cv=cv, scoring="neg_mean_squared_error")

{'fit_time': array([0.00129414, 0.00083709]),
 'score_time': array([0.00055003, 0.000489  ]),
 'test_score': array([-0.27150562, -0.26527336])}

- but what if we also want to pass sample_weight?

- we would use sample_weight when we want to draw the attention of the machine learning algorithm to a specific group of samples, that our data under-represents in some way

- in our example talking about a medical study on the effectiveness of a treatment, sample_weight might
    - encode sex or race or the patient (to balance out data) 
    - or if we suspect a correlation between a feature an the fact if a patient got the new treatment we are interested in, e.g. a bias in which patient was chosen for the treatment, then sample_weight could be used to counter-balance that

- there are several methods to determine sample_weight (from calculating proportions or more enhanced statistics from the data, or using more general statistical principles or natural laws that we know will take effect)
- in general, we are interested to train our model on fair data in determining the efficiency of a treatment on future patients (even if past data was not gathered from a randomized trial, but is messy real world data instead)




-----------------------------------------
older ideas of what sample_weight is:
- when do we want to pass sample_weight?
    - when we are interested in minimizing the error of predictions for a certain sub-group of the samples more than the general error (by giving this sub-group a higher sample_weight than the rest of the data)
    - the loss for this particular sub-group then results often smaller compared to only train on the samples that we are interested in because we take the richness of all the data into account

In [31]:
# leave out blunt attempt, because its confusing?

from sklearn.linear_model import RidgeCV
from sklearn.metrics import get_scorer

scoring = get_scorer("neg_mean_squared_error")

# this is to show, not the real one used:
# sample_weight=[0.4, 0.3, 0.6, 0.8, 0.5]

cross_validate(RidgeCV(scoring=scoring), X, y, groups=groups, cv=cv, sample_weight=sample_weight, scoring=scoring)

TypeError: got an unexpected keyword argument 'sample_weight'

- without the routing API, we couldn't use sample_weight here, nor could we put it in a Pipeline or use it in any other nested structure

### 3. Using the metadata routing API

In [36]:
from sklearn.linear_model import RidgeCV
from sklearn.metrics import get_scorer

import sklearn
sklearn.set_config(enable_metadata_routing=True)

scoring = get_scorer("neg_mean_squared_error").set_score_request(sample_weight=True)

ridge = RidgeCV(cv=GroupKFold(n_splits=2),scoring=scoring).set_fit_request(sample_weight=True).fit(X, y, sample_weight=sample_weight, groups=groups)

cross_validate(
    ridge,
    X,
    y,
    params={"sample_weight": sample_weight, "groups": groups},
    cv=GroupKFold(n_splits=2),
    scoring=scoring,
)

sklearn.set_config(enable_metadata_routing=False)

ValueError: 
All the 2 fits failed.
It is very likely that your model is misconfigured.
You can try to debug the error by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
2 fits failed with the following error:
Traceback (most recent call last):
  File "/home/stefanie/.pyenv/versions/metadata_talk/lib/python3.12/site-packages/sklearn/model_selection/_validation.py", line 888, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "/home/stefanie/.pyenv/versions/metadata_talk/lib/python3.12/site-packages/sklearn/base.py", line 1473, in wrapper
    return fit_method(estimator, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/stefanie/.pyenv/versions/metadata_talk/lib/python3.12/site-packages/sklearn/linear_model/_ridge.py", line 2672, in fit
    super().fit(X, y, sample_weight=sample_weight, **params)
  File "/home/stefanie/.pyenv/versions/metadata_talk/lib/python3.12/site-packages/sklearn/linear_model/_ridge.py", line 2436, in fit
    grid_search.fit(X, y, **params)
  File "/home/stefanie/.pyenv/versions/metadata_talk/lib/python3.12/site-packages/sklearn/base.py", line 1473, in wrapper
    return fit_method(estimator, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/stefanie/.pyenv/versions/metadata_talk/lib/python3.12/site-packages/sklearn/model_selection/_search.py", line 1018, in fit
    self._run_search(evaluate_candidates)
  File "/home/stefanie/.pyenv/versions/metadata_talk/lib/python3.12/site-packages/sklearn/model_selection/_search.py", line 1572, in _run_search
    evaluate_candidates(ParameterGrid(self.param_grid))
  File "/home/stefanie/.pyenv/versions/metadata_talk/lib/python3.12/site-packages/sklearn/model_selection/_search.py", line 976, in evaluate_candidates
    for (cand_idx, parameters), (split_idx, (train, test)) in product(
                                                              ^^^^^^^^
  File "/home/stefanie/.pyenv/versions/metadata_talk/lib/python3.12/site-packages/sklearn/model_selection/_split.py", line 416, in split
    for train, test in super().split(X, y, groups):
  File "/home/stefanie/.pyenv/versions/metadata_talk/lib/python3.12/site-packages/sklearn/model_selection/_split.py", line 147, in split
    for test_index in self._iter_test_masks(X, y, groups):
  File "/home/stefanie/.pyenv/versions/metadata_talk/lib/python3.12/site-packages/sklearn/model_selection/_split.py", line 159, in _iter_test_masks
    for test_index in self._iter_test_indices(X, y, groups):
  File "/home/stefanie/.pyenv/versions/metadata_talk/lib/python3.12/site-packages/sklearn/model_selection/_split.py", line 602, in _iter_test_indices
    raise ValueError("The 'groups' parameter should not be None.")
ValueError: The 'groups' parameter should not be None.


- explain:
    - scorer take sample weight (scoring= param is only present in estimators ending in CV); the scoring then passes the metadata into the metric used in cross validation for evaluating the success with the internal validation set
    - slitter splits CV and is mainly interested in groups

- summing up: with metadata routing API:
    - we can pass sample_weight and groups through several levels of estimators and pipes in scikit-learn
    - we can combine objects from other libraries with scikit-learn estimators while still passing their metadata
    - we can define our own custom metrics using self defined metadata 
    - and use it in a special setting with [TunedThresholdClassifier](https://scikit-learn.org/dev/auto_examples/model_selection/plot_cost_sensitive_learning.html#cost-sensitive-learning-when-gains-and-costs-are-not-constant) (as we will see in the next part)

In [34]:
from sklearn.metrics import make_scorer, accuracy_score
from sklearn.linear_model import LogisticRegressionCV
from sklearn.model_selection import cross_validate, GroupKFold


sklearn.set_config(enable_metadata_routing=True)

weighted_acc = make_scorer(accuracy_score).set_score_request(sample_weight=True)
lr = LogisticRegressionCV(
    cv=GroupKFold(n_splits=2),
    scoring=weighted_acc
).set_fit_request(sample_weight=True)
cv_results = cross_validate(
    lr,
    X,
    y,
    params={"sample_weight": sample_weight, "groups": groups},
    cv=GroupKFold(n_splits=2),
    scoring=weighted_acc,
)

sklearn.set_config(enable_metadata_routing=False)

### 4. Further information

- User Guide on Metadata Routing: </br> 
[https://scikit-learn.org/stable/metadata_routing.html#metadata-routing](https://scikit-learn.org/stable/metadata_routing.html#metadata-routing)

- Developer Guide on Metadata Routing: </br> 
[https://scikit-learn.org/stable/auto_examples/miscellaneous/plot_metadata_routing.html#metadata-routing](https://scikit-learn.org/stable/auto_examples/miscellaneous/plot_metadata_routing.html#metadata-routing)

- Adrin Jalali's talk on the internal logic of metadata routing at EuroPython Conference 2023: </br> 
[https://www.youtube.com/watch?v=1rf6HI-pYq8](https://www.youtube.com/watch?v=1rf6HI-pYq8)

- Blogpost by Florian Wilhelm on Inverse Probability of Treatment Weighting: </br> 
[https://florianwilhelm.info/2017/04/causal_inference_propensity_score](https://florianwilhelm.info/2017/04/causal_inference_propensity_score)

- link to Vincent's VW video