# "Scikit-learn's Metadata Routing API"

### 1. What is Metadata?

- definition of metadata:
    - metadata can be any data, that we want to apply on top of our tabular data, without it necessarily being part of it
    - alternative: metadata is any data, that some object or function in a data science "pipeline/process" handles besides from X and y; it's a param that influences this "step's" treatment of the data

- examples for metadata:
    - classical examples you might know from scikit-learn: sample_weight and groups
    - but other libraries offer support for other kinds of metadata
    - (graphically show some other examples of metadata: gender, race, sex, zipcode, .... and area/library it is used in)
    - self defined metadata to be used in custom metrics (and possibly custom estimators)

- use cases for metadata:
    - sample_weight and groups can be used to balance data out and prevent data leakage
    - fairness related use case
    - business logic

- definition of routing:
    - routing just means that we pass metadata around (or through uninvolved steps) in a data science pipeline/process to where it is used/consumed

- before metadata routing API:
    - we were limited to where sample_weight was defined because it was the only param defined in the metrics (other than y_true and y_pred)
    - we could not consistently use it if we were using the metric in a larger structure

- with metadata routing API:
    - we can pass sample_weight and groups through several levels of estimators and pipes in scikit-learn
    - we can combine objects from other libraries with scikit-learn estimators while still passing their metadata
    - we can define our own custom metrics using self defined metadata (as we will see in the next part)



### 2. Passing metadata without the routing API
- very short code example to have people see the problem

- https://lms.fun-mooc.fr/courses/course-v1:inria+41026+session04/courseware/cb3cfcaf0cae4cf7801c4e8d5dab9087/da37206baa8d4426b93124368fabaf1e/

- here groups corresponds to a writer of digits, since the same writer should not be both in the train and in the test set
    - before we had to do a groupKFold
    - but what if you also have sample_weights? --> no way to pass it as well


### 3. Using the metadata routing API

- example: medical use case? 
    - groups correspond to hospital (we suspect that each hospital’s data may have systematic biases due to factors like medical devices, policies, socioeconomic status of the patients, ...)
    - sample_weight encodes zip-code or race of the patient
    - we are interested in a fair model to determine a recommendation for a treatment among several options (classification problem)

    - sample_weights either given by experts
    - or derived from the data to even out unfairness
    - or derived from some other process stemming from the moment when initializing that sample

- boilerplate in a very short example
    - weighted scoring and fitting example from documentation: https://scikit-learn.org/stable/metadata_routing.html#metadata-routing

- scorer take sample weight (scoring= param is only present in estimators ending in CV); the scoring then passes the metadata into the metric used in cross validation for evaluating the success with the internal validation set
- slitter splits CV and is mainly interested in groups

- set stage for TunedThresholdClassifier:
    - TunedThresholdClassifier scales threshold for metrics and with metadata routing we can define our own scoring metrics that also consume metadata (https://scikit-learn.org/dev/auto_examples/model_selection/plot_cost_sensitive_learning.html#cost-sensitive-learning-when-gains-and-costs-are-not-constant)

In [2]:
import sklearn
sklearn.set_config(enable_metadata_routing=True)


### 4. Further information

- User Guide on Metadata Routing: </br> 
[https://scikit-learn.org/stable/metadata_routing.html#metadata-routing](https://scikit-learn.org/stable/metadata_routing.html#metadata-routing)

- Developer Guide on Metadata Routing: </br> 
[https://scikit-learn.org/stable/auto_examples/miscellaneous/plot_metadata_routing.html#metadata-routing](https://scikit-learn.org/stable/auto_examples/miscellaneous/plot_metadata_routing.html#metadata-routing)

- Adrin Jalali's talk on the internal logic of metadata routing at EuroPython Conference 2023: </br> 
[https://www.youtube.com/watch?v=1rf6HI-pYq8](https://www.youtube.com/watch?v=1rf6HI-pYq8)