# What if I know the propensity score?

In some experiment settings we may know beforehand the probabilities of treatment assignments,
e.g. if we have data from a {term}`RCT<Randomized Control Trial (RCT)>` with known treatment
probabilities.

In that case we may not want to learn a propensity model rather just use the known probabilities.

Loading the data
----------------

Just like in our {ref}`example on estimating CATEs with a MetaLearner
<example-basic>`, we will first load some experiment data:

In [None]:
import pandas as pd
from pathlib import Path
from git_root import git_root

df = pd.read_csv(git_root("data/learning_mindset.zip"))
outcome_column = "achievement_score"
treatment_column = "intervention"
feature_columns = [
    column for column in df.columns if column not in [outcome_column, treatment_column]
]
categorical_feature_columns = [
    "ethnicity",
    "gender",
    "frst_in_family",
    "school_urbanicity",
    "schoolid",
]
# Note that explicitly setting the dtype of these features to category
# allows both lightgbm as well as shap plots to
# 1. Operate on features which are not of type int, bool or float
# 2. Correctly interpret categoricals with int values to be
#    interpreted as categoricals, as compared to ordinals/numericals.
for categorical_feature_column in categorical_feature_columns:
    df[categorical_feature_column] = df[categorical_feature_column].astype("category")

Creating our own estimator
--------------------------

In this tutorial we will assume that we know that all observations were assigned to the
treatment with a fixed probability of 0.3, which is close to the fraction of the observations
assigned to the treatment group:

In [None]:
df[treatment_column].mean()

```{note}
The fact that we have a fixed propensity score for all observations is not true for this
dataset, we just use it for illustrational purposes.
```

Now we can define our custom ``sklearn``-like classifier. We recommend inheriting from
the ``sklearn`` base classes and following the rules explained in the
[sklearn documentation](https://scikit-learn.org/stable/developers/develop.html) to avoid
having to define helper functions and ensure the correct functionality of the ``metalearners``
library.

In [None]:
from sklearn.base import BaseEstimator, ClassifierMixin
from typing import Any
from typing_extensions import Self
import numpy as np
import pandas as pd


class FixedPropensityModel(ClassifierMixin, BaseEstimator):
    def __init__(self, propensity_score: float) -> None:
        self.propensity_score = propensity_score

    def fit(self, X: pd.DataFrame, y: pd.Series) -> Self:
        self.classes_ = np.unique(y.to_numpy())  # sklearn requires this
        return self

    def predict(self, X: pd.DataFrame) -> np.ndarray[Any, Any]:
        return np.argmax(self.predict_proba(X), axis=1)

    def predict_proba(self, X: pd.DataFrame) -> np.ndarray[Any, Any]:
        return np.full((len(X), 2), [1 - self.propensity_score, self.propensity_score])

Fitting the MetaLearner
-----------------------

Finally we can instantiate and fit our MetaLearner using our own custom propensity model:

In [None]:
from metalearners import RLearner
from lightgbm import LGBMRegressor

rlearner = RLearner(
    nuisance_model_factory=LGBMRegressor,
    propensity_model_factory=FixedPropensityModel,
    treatment_model_factory=LGBMRegressor,
    nuisance_model_params={"verbose": -1},
    propensity_model_params={"propensity_score": 0.3},
    treatment_model_params={"verbose": -1},
    is_classification=False,
    n_variants=2,
)
rlearner.fit(
    X=df[feature_columns],
    y=df[outcome_column],
    w=df[treatment_column],
)

We can check that the propensity estimates correspond to our expectation:

In [None]:
rlearner.predict_nuisance(
    X=df[feature_columns], model_kind="propensity_model", model_ord=0, is_oos=False
)

Further comments
----------------

* This example shows how we can use the same propensity score for all observations in the
  binary treatment setting, the class could be easily extended for multiple treatment
  variants a. Moreover, customizing the propensity score according to some simple 
  extracted from the input features could easily be accommodated analogously.