# Regression example with Thetis

Thetis can evaluate AI systems that perform regression tasks. In this example, we demonstrate how to evaluate and rate an AI model using a basic regression example on a synthetic dataset. We utilize a Bayesian Ridge Regression model provided by [scikit-learn](https://scikit-learn.org/). The instructions below should be easy to adapt to your own use case.

## Set up the environment

If you haven't done so already, install Thetis using pip:

```shell
$ pip install thetis
```

For this example, you can use the demo license located within the same directory as this notebook.
This license only works for our demonstration dataset with the exact configuration provided in this notebook.
Use the license file [demo_license_regression.dat](https://raw.githubusercontent.com/EFS-OpenSource/Thetis/main/examples/demo_license_regression.dat).

**Important**: Do not modify the random seed that we use below to generate the dataset if you are using the demo license, since the demo license is tied to this exact dataset.

Place the license file either in the working directory of your application or at:

- Windows: `<User>/AppData/Local/Thetis/license.dat`
- Unix: `~/.local/thetis/license.dat`

## Increase logging verbosity

To obtain detailed runtime information about Thetis, run the following cell. This will add a logging handler to the Thetis logger, increasing the application's verbosity.

In [None]:
import logging
import os
import sys


# Configure root logger as catch-all logging config
logger = logging.getLogger("Thetis")
logger.setLevel(logging.INFO)
handler = logging.StreamHandler(sys.stderr)
handler.setFormatter(logging.Formatter('%(asctime)s - %(name)s - %(levelname)s - %(message)s'))
logger.addHandler(handler)

In [None]:
import numpy as np
import pandas as pd

n_samples = 10000
testset_size = 1000

## Prepare example data and model

For the purpose of this tutorial, we will create a synthetic dataset that records a person's net worth as a standardized asset score, along with two generic observable features, "feature_0" and "feature_1." The regression task will be to predict the asset scores based on these features: one feature will have a linear relationship, and the other a non-linear relationship with the target score.

Our dataset will also include two sensitive features, "gender" and "age." We will artificially introduce bias into this dataset as follows:

- Gender: Males will have a higher asset level compared to females.
- Age: Seniors will have a higher asset level, adults a medium level, and juniors a lower level.

In [None]:
def generate_regression_demo_dataset(n_samples: int):
    """
    Create a dataset with the following properties:

    Sensitive attributes: gender (male, female), age (junior, adult, senior)
    Target: asset (standardized)

    Induce the following biases:
    - gender: male has higher income level compared to female
    - age: senior has most, adult medium and junior less
    """
    
    # IMPORTANT: keep this seed fixed, otherwise the demonstration license will reject the resulting dataset
    np.random.seed(0)

    # first, generate some data regarding the protected/sensitive attributes "gender" and "age"
    gender = np.random.choice(["male", "female"], replace=True, p=[0.5, 0.5], size=n_samples)
    age = np.random.choice(["junior", "adult", "senior"], replace=True, p=[1./3., 1./3., 1./3.], size=n_samples)

    # second, generate the (standardized) target asset scores which are drawn by a normal distribution
    assets = np.random.normal(loc=5, scale=2, size=n_samples)

    # induce bias where male have a higher asset level compared to female
    assets[gender == "male"] = assets[gender == "male"] + np.random.uniform(0.1, 1.0, size=len(assets[gender == "male"]))
    
    # induce bias where junior have a lower and senior a higher asset level
    assets[age == "junior"] = assets[age == "junior"] - np.random.uniform(0.1, 1.0, size=len(assets[age == "junior"]))
    assets[age == "senior"] = assets[age == "senior"] + np.random.uniform(0.1, 1.0, size=len(assets[age == "senior"]))
    
    # clip everything to positive values
    np.clip(assets, 1e-4, None, out=assets)
    
    # feature_0 has a linear relationship with target with some Gaussian noise
    feature_0 = assets + np.random.normal(0, 1.0, size=n_samples)
    
    # feature_1 has a non-linear relationship with target with Gaussian noise
    feature_1 = np.sqrt(assets) + np.random.normal(0, 1.0, size=n_samples)
    
    # finally, gather information into pd.DataFrame instances
    features = pd.DataFrame({"feature_0": feature_0, "feature_1": feature_1})
    annotations = pd.DataFrame({"target": assets, "gender": gender, "age": age})

    return features, annotations

Using this function, we will now generate our dataset. This function produces two [Pandas](https://pandas.pydata.org/) data frames: one containing the features and the other containing the target annotations. We then split this data into training and testing sets.

In [None]:
# generate dataset
features, annotations = generate_regression_demo_dataset(n_samples=n_samples)

# split into training and testing data
df_train, df_test = features.iloc[:-testset_size], features.iloc[-testset_size:]
annotations_train, annotations_test = annotations.iloc[:-testset_size], annotations.iloc[-testset_size:]

# Train the regression model

In the next step, we will train a simple Bayesian Ridge Regression model on the training data using scikit-learn. We will then use the trained model to make predictions on the test data.

*Note*: we use "return_std=True" with the "predict()" function to obtain additional uncertainty information about the predictions.

In [None]:
from sklearn.linear_model import BayesianRidge

# initialize model and call "fit()" function
bayesian_ridge = BayesianRidge().fit(
    X=df_train.to_numpy(), 
    y=annotations_train["target"].to_numpy().squeeze()
)

In [None]:
# make predictions on the test set and obtain additional uncertainty information
pred, stddev = bayesian_ridge.predict(df_test.to_numpy(), return_std=True)

# gather prediction information into a single pd.DataFrame
# IMPORTANT: the index must be the same as the index of the "annotations" data frame
predictions = pd.DataFrame({"predictions": pred, "stddev": stddev}, index=annotations_test.index)

## Run Thetis to analyze and evaluate the AI system

You can download the [demo configuration file](https://raw.githubusercontent.com/EFS-OpenSource/Thetis/main/examples/demo_config_regression.yaml) for this example from our repository. For detailed information on Thetis configuration, refer to the [Configuration](https://efs-opensource.github.io/Thetis/configuration.html) section.

In addition to generating the report in PDF format, which we display below, Thetis also returns its findings, final rating, and recommendations for mitigation strategies as a JSON-like dictionary. We capture this dictionary as `result` and access it as follows:

* `result[<aspect>]` contains a sub-dictionary with results for each aspect of the analysis, e.g. 'fairness' or 'uncertainty'.
* `result[<aspect>]['rating_score']` contains the rating as a score from 0 to 10.
* `result[<aspect>]['rating_enum']` contains the rating as a grade, which can be `'GOOD'`, `'MEDIUM'`, or `'BAD'`, depending on the rating score.
* `result[<aspect>]['recommendations']` contains findings regarding possible issues and recommendations for mitigation.

In [None]:
from thetis import thetis


result = thetis(
   config="demo_config_regression.yaml",
   annotations=annotations_test,
   predictions=predictions,
   output_dir="./output",
   license_file_path="demo_license_regression.dat"
)

In [None]:
# show the PDF report within the current Jupyter notebook
from IPython.display import IFrame

IFrame("./output/report.pdf", width=800, height=1024)