# Regression Example with Thetis

Thetis can evaluate the AI safety of regression models.
In a first step, we demonstrate how to evaluate and rate your AI model using a basic regression example on a custom demo dataset. We utilize a Bayesian Ridge Regression model provided by [scikit-learn](https://scikit-learn.org/).
The instructions below should be easy do adapt to your own use-case.

## Set Up the Environment

In a first step, you need to install Thetis by using pip:

```shell
$ pip install thetis
```

Next, you need to obtain a license in order to use Thetis.

For the current example, you can use the *demo license* located within the same directory as this notebook.
This license only works for our demonstration data set with the exact configuration provided in this notebook.
Use the license file [demo_license_regression.dat](https://raw.githubusercontent.com/EFS-OpenSource/Thetis/main/examples/demo_license_regression.dat).

**Important**: Do not modify the random seed! Since the license is only valid for the actual demonstration case, it is necessary to keep the random seed fixed in order to make the demonstration work.

A customized *full license*, enabling you to run Thetis with your own data sets and settings, is available at our [Subscription Page](https://efs-opensource.github.io/Thetis/subscription.html).

Place the license file either in the working directory of your application or at:

- Windows: `<User>/AppData/Local/Thetis/license.dat`
- Unix: `~/.local/thetis/license.dat`

## Increase Logging Verbosity

For detailed runtime information about Thetis, run the following cell to add a logging handler to the Thetis logger to increase verbosity of the application.

In [None]:
import logging
import os


# Configure root logger as catch-all logging config
logger = logging.getLogger("Thetis")
logger.setLevel(logging.INFO)
handler = logging.StreamHandler(sys.stderr)
handler.setFormatter(logging.Formatter('%(asctime)s - %(name)s - %(levelname)s - %(message)s'))
logger.addHandler(handler)

In [None]:
import numpy as np
import pandas as pd

n_samples = 10000
testset_size = 1000

## Prepare Example Data and Model

To start with our basic regression example, we need to generate some data. In this tutorial, we use a custom demo data set with the task of asset estimation (standardized) based on two features, one with a linear and one with a non-linear relationship to the target score.

We artifcially induce some biases into this data set:
- "gender": male has higher asset level compared to female.
- "age": senior has a higher asset level, adult a medium one and junior a lower one.

In [None]:
def generate_regression_demo_dataset(n_samples: int):
    """
    Create data set with following properties:

    Sensitive attributes: gender (male, female), age (junior, adult, senior)
    Target: asset (standardized)

    Induce the following biases:
    - gender: male has higher income level compared to female
    - age: senior has most, adult medium and junior less
    """
    
    # IMPORTANT: keep this seed fixed, otherwise the demonstration license won't work
    np.random.seed(0)

    # first, generate some data regarding the protected/sensitive attributes "gender" and "age"
    gender = np.random.choice(["male", "female"], replace=True, p=[0.5, 0.5], size=n_samples)
    age = np.random.choice(["junior", "adult", "senior"], replace=True, p=[1./3., 1./3., 1./3.], size=n_samples)

    # second, generate the (standardized) target asset scores which are drawn by a normal distribution
    assets = np.random.normal(loc=5, scale=2, size=n_samples)

    # induce bias where male have a higher asset level compared to female
    assets[gender == "male"] = assets[gender == "male"] + np.random.uniform(0.1, 1.0, size=len(assets[gender == "male"]))
    
    # induce bias where junior have a lower and senior a higher asset level
    assets[age == "junior"] = assets[age == "junior"] - np.random.uniform(0.1, 1.0, size=len(assets[age == "junior"]))
    assets[age == "senior"] = assets[age == "senior"] + np.random.uniform(0.1, 1.0, size=len(assets[age == "senior"]))
    
    # clip everything to positive values
    np.clip(assets, 1e-4, None, out=assets)
    
    # feature_0 has a linear relationship with target with some Gaussian noise
    feature_0 = assets + np.random.normal(0, 1.0, size=n_samples)
    
    # feature_1 has a non-linear relationship with target with Gaussian noise
    feature_1 = np.sqrt(assets) + np.random.normal(0, 1.0, size=n_samples)
    
    # finally, gather information into pd.DataFrame instances
    features = pd.DataFrame({"feature_0": feature_0, "feature_1": feature_1})
    annotations = pd.DataFrame({"target": assets, "gender": gender, "age": age})

    return features, annotations

With this function, we are able to generate our demo data set.
This yields two [Pandas](https://pandas.pydata.org/) data frames with information about the features and the target annotations. We further split this data into training and testing sets.

In [None]:
# generate data set
features, annotations = generate_regression_demo_dataset(n_samples=n_samples)

# split into training and testing data
df_train, df_test = features.iloc[:-testset_size], features.iloc[-testset_size:]
annotations_train, annotations_test = annotations.iloc[:-testset_size], annotations.iloc[-testset_size:]

# Train Regression Model

In the next step, we train a simple Bayesian Ridge Regression model on the training data using scikit-learn.
Furthermore, we make predictions on the test data using the trained model.

*Note*: we use "return_std=True" on the "predict()" function to also obtain additional uncertainty information about the predictions.

In [None]:
from sklearn.linear_model import BayesianRidge

# initialize model and call "fit()" function
bayesian_ridge = BayesianRidge().fit(
    X=df_train.to_numpy(), 
    y=annotations_train["target"].to_numpy().squeeze()
)

In [None]:
# make predictions on the test set and obtain additional uncertainty information
pred, stddev = bayesian_ridge.predict(df_test.to_numpy(), return_std=True)

# gather prediction information into a single pd.DataFrame
# IMPORTANT: the index must be the same as the index of the "annotations" data frame
predictions = pd.DataFrame({"predictions": pred, "stddev": stddev}, index=annotations_test.index)

## Run AI Safety Evaluation with Thetis

For all the details of Thetis configuration, see section [Configuration](https://efs-opensource.github.io/Thetis/configuration.html). You can download the [demo configuration file](https://raw.githubusercontent.com/EFS-OpenSource/Thetis/main/examples/demo_config_regression.yaml) for the current example from this repository or from [here](https://thetishostedfiles.blob.core.windows.net/demofiles/thetis_demo_regression.zip).

Thetis returns its findings, the final rating and recommendations for mitigation strategies as a JSON-like dictionary. Below, we capture the dictionary as `result` and can access the different evaluation aspects:

* `result[<task>]['rating_score']` for the rating score of the selected task (e.g., 'fairness' or 'uncertainty').
* `result[<task>]['recommendations']` for the recommendations to mitigate possible issues of the selected task.
* `result[<task>]['rating_enum']` for a categorization of the actual aspect into `'GOOD'`, `'MEDIUM'`,
  or `'BAD'` depending on the rating score.

In [None]:
from thetis import thetis


result = thetis(
   config="demo_config_regression.yaml",
   annotations=annotations_test,
   predictions=predictions,
   output_dir="./output",
   license_file_path="demo_license_regression.dat"
)

In [None]:
# show the PDF report within the current Jupyter notebook
from IPython.display import IFrame

IFrame("./output/report.pdf", width=800, height=1024)