# Task 1: Anon

In this example, we are going to apply [Anon](https://github.com/DataResponsibly/Anon) to conduct a deep performance profiling for three models trained on the [Diabetes Dataset 2019](https://www.kaggle.com/datasets/tigganeha4/diabetes-dataset-2019/data). We will show how to create input arguments for Anon and how to compute overall and disparity metrics with a metric computation interface.

The structure of this notebook is the following:
* **Step 1**: Create a _config yaml_ for metric computation.
* **Step 2**: Preprocess a dataset and construct a _BaseFlowDataset_ object.
* **Step 3**: Tune models and create a _models config_.
* **Step 4**: Run a metric computation interface from Anon.
* **Step 5**: Compose disparity metrics using _Metric Composer_.

## Install necessary packages and import dependencies

In [None]:
# Install Anon using pypi. The library supports Python 3.9-3.11.
!pip install anon

In [None]:
!pip install xgboost>=1.7.2

In [None]:
import os
import warnings
warnings.filterwarnings('ignore')
os.environ["PYTHONWARNINGS"] = "ignore"

In [None]:
from pprint import pprint
from datetime import datetime, timezone

import pandas as pd

from xgboost import XGBClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier

from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import StandardScaler

from anon.utils.custom_initializers import create_config_obj, read_model_metric_dfs
from anon.user_interfaces.multiple_models_api import compute_metrics_with_config
from anon.preprocessing.basic_preprocessing import preprocess_dataset
from anon.custom_classes.metrics_interactive_visualizer import MetricsInteractiveVisualizer
from anon.custom_classes.metrics_visualizer import MetricsVisualizer
from anon.custom_classes.metrics_composer import MetricsComposer
from anon.utils.model_tuning_utils import tune_ML_models

## **Step 1**: Create a _config yaml_ for metrics computation.

Review the dataset.

In [None]:
from anon.datasets import DiabetesDataset2019

data_loader = DiabetesDataset2019(with_nulls=False, subsample_seed=42)
data_loader.full_df.head()

First, we need to create a _config yaml_, which includes the following parameters for metrics computation:

* **dataset_name**: str, a name of the dataset; it will be used to name files with metrics.

* **bootstrap_fraction**: float, the fraction from a train set in the range [0.0 - 1.0] to fit models in bootstrap (usually more than 0.5).

* **random_state**: int, a seed to control the randomness of the whole model evaluation pipeline.

* **n_estimators**: int, the number of estimators for bootstrap to compute stability and uncertainty metrics.

* **sensitive_attributes_dct**: dict, a dictionary where keys are sensitive attribute names (including intersectional attributes), and values are disadvantaged values for these attributes. Intersectional attributes must include '&' between sensitive attributes. You do not need to specify disadvantaged values for intersectional groups since they will be derived from disadvantaged values in sensitive_attributes_dct for each separate sensitive attribute in this intersectional pair.

### **TODO 1**: Define one sensitive group in the *sensitive_attributes_dct* section of the *experiment_config.yaml*. Explain why the selected sensitive attribute is appropriate for this prediction task and its context.

**Hint:** Do NOT use "Age" as a sensitive attribute. Age is a real clinical risk factor for diabetes, so enforcing equal rates across age groups would be misleading.

To decide on `<SENSITIVE_ATTRIBUTE>` and `<DISADVANTAGED_VALUE>`, you may review:
- The **Kaggle Diabetes Dataset 2019** page:  
  https://www.kaggle.com/datasets/tigganeha4/diabetes-dataset-2019/data  
  (this helps you understand the distribution of each column)
- The previously shared **Tutorial Notes**:  
  https://drive.google.com/file/d/1BmXdxEEFiRM19Ny7V3hjEemnLmiFnbVc/view?usp=sharing  
  (for definitions and contextual meaning)

Note that within the *sensitive_attributes_dct:*
- key is the column name in the dataset
- value is the disadvantaged value withing this column; it can be a literal or a list, see examples below.

```
sensitive_attributes_dct: {'sex': 'female'}
```

```
sensitive_attributes_dct: {'age': [19, 20, 21, 22, 23, 24, 25]}
```


In [None]:
config_yaml_path = os.path.join('.', 'experiment_config.yaml')
config_yaml_content = """
dataset_name: Diabetes_2019
bootstrap_fraction: 0.8
random_state: 42
n_estimators: 20  # Better to input the higher number of estimators than 100; this is only for this user study
sensitive_attributes_dct: {'<SENSITIVE_ATTRIBUTE>': '<DISADVANTAGED_VALUE>'}
"""

with open(config_yaml_path, 'w', encoding='utf-8') as f:
    f.write(config_yaml_content)

In [None]:
config = create_config_obj(config_yaml_path=config_yaml_path)
SAVE_RESULTS_DIR_PATH = os.path.join('.', 'results', f'{config.dataset_name}_Metrics_{datetime.now(timezone.utc).strftime("%Y%m%d__%H%M%S")}')

***Briefly write your answer in the Google form***

## **Step 2**: Preprocess the dataset and construct a _BaseFlowDataset_ object.

Second, we need to preprocess the dataset.

Define preprocessing steps and initialize a column transformer.

In [None]:
column_transformer = ColumnTransformer(transformers=[
    ('categorical_features', OneHotEncoder(handle_unknown='ignore', sparse_output=False), data_loader.categorical_columns),
    ('numerical_features', StandardScaler(), data_loader.numerical_columns),
])

Construct a BaseFlowDataset object.

In [None]:
DATASET_SPLIT_SEED = 42
MODELS_TUNING_SEED = 42
TEST_SET_FRACTION = 0.2

base_flow_dataset = preprocess_dataset(data_loader=data_loader,
                                       column_transformer=column_transformer,
                                       sensitive_attributes_dct=config.sensitive_attributes_dct,
                                       test_set_fraction=TEST_SET_FRACTION,
                                       dataset_split_seed=DATASET_SPLIT_SEED)

## **Step 3**: Tune models and create a _models config_.

Next, we need to construct a _models config_ that includes initialized models you want to profile with Anon. For that, the models should be tuned using the _tune_ML_models()_ function from Anon or in any other convenient way.

Define models and hyper-parameters to tune using GridSearchCV from sklearn.

In [None]:
models_params_for_tuning = {
    'LogisticRegression': {
        'model': LogisticRegression(random_state=MODELS_TUNING_SEED),
        'params': {
            'penalty': ['l2'],
            'C' : [0.0001, 0.1, 1, 100],
            'solver': ['newton-cg', 'lbfgs'],
            'max_iter': [250],
        }
    },
    'KNeighborsClassifier': {
        'model': KNeighborsClassifier(),
        'params': {
            'weights' : ['uniform'],
            'algorithm' : ['auto'],
            'n_neighbors' : [3, 4, 5],
        }
    },
    'XGBClassifier': {
        'model': XGBClassifier(random_state=MODELS_TUNING_SEED, verbosity=0),
        'params': {
            'learning_rate': [0.1],
            'n_estimators': [50],
            'max_depth': [5, 7],
            'lambda':  [10, 100]
        }
    }
}

Tune models using the _tune_ML_models()_ function from Anon and create the _models config_.

In [None]:
tuned_params_df, models_config = tune_ML_models(models_params_for_tuning, base_flow_dataset, dataset_name='Diabetes_2019', n_folds=3)
tuned_params_df

In [None]:
pprint(models_config)

### **TODO 2:** Based on these accuracy metrics, which model would you choose for this classification task? Can you say anything about who may be disadvantaged and in what way when the model makes an error?

***Briefly write your answer in the Google form***

## **Step 4**: Run a metric computation interface from Anon.

After that we need to input the _BaseFlowDataset_ object, _models config_, and _config yaml_ to a metric computation interface and execute it. The interface uses subgroup analyzers to compute different sets of metrics for each privileged and disadvantaged group. When the variance and error analyzers complete metric computation, their metrics are combined, returned in a matrix format, and stored in a file if defined.

In [None]:
metrics_dct = compute_metrics_with_config(base_flow_dataset, config, models_config, SAVE_RESULTS_DIR_PATH, notebook_logs_stdout=True)

### **TODO 3:** Fill in the index of your chosen model below from the table in step 3, and run to see some more comprehensive metrics for the protected classes in the dataset

In [None]:
model_index =

View the computed metrics for one model.

In [None]:
sample_model_metrics_df = metrics_dct[list(models_config.keys())[model_index]]
sample_model_metrics_df[(sample_model_metrics_df['Metric'] == "TPR") | (sample_model_metrics_df['Metric'] == "TNR") | (sample_model_metrics_df['Metric'] == "PPV") | (sample_model_metrics_df['Metric'] == "FPR") | (sample_model_metrics_df['Metric'] == "FNR")]

### **TODO 4:** Consider your chosen model's error rates (FPR, FNR) for the protected classes. Does your model further disadvantage the already disadvantaged group? (note that _priv denotes the privileged group and _dis denotes the disadvantaged group for each protected class)

***Briefly write your answer in the Google form***

### **TODO 5:** Select any additional model. Fill in model names to compare additional metrics for protected classes against those of your original model.

***Briefly write your answers on the following questions in the Google form:***


*   **Would you change your original choice? Why or why not?**



In [None]:
chosen_model, additional_model = '<ORIGINALLY_CHOSEN_MODEL_NAME>', '<ADDITIONAL_MODEL_NAME>'

df_concat = pd.concat([metrics_dct[chosen_model], metrics_dct[additional_model]], keys=[chosen_model, additional_model])
df_concat = df_concat.reset_index()

metrics = ['Accuracy', 'F1', 'TPR', 'TNR', 'FPR', 'FNR']
sensitive_attr = list(config.sensitive_attributes_dct.keys())[0]
df_pivot = df_concat.pivot(
    index='Metric',
    columns='Model_Name',
    values=['overall', f'{sensitive_attr}_priv', f'{sensitive_attr}_dis']
)
df_pivot.loc[metrics]

## **Step 5**: Compose disparity metrics using _Metric Composer_.

To compose disparity metrics, the _Metric Composer_ should be applied.

Read the computed metrics from a file created by the metric computation interface.

In [None]:
models_metrics_dct = read_model_metric_dfs(SAVE_RESULTS_DIR_PATH, model_names=list(models_config.keys()))

Compose disparity metrics using _MetricsComposer_.

In [None]:
metrics_composer = MetricsComposer(models_metrics_dct, config.sensitive_attributes_dct)
models_composed_metrics_df = metrics_composer.compose_metrics()

### **TODO 6:** Fill in the *model_name* variable with the model you chose in TODO 4 and calculate fairness metrics. Then answer: Does this model demonstrate satisfactory fairness? Explain why or why not.

*Hint:* Values closer to zero are better.

In [None]:
# models_composed_metrics_df
model_name = ''

models_composed_metrics_df = models_composed_metrics_df[models_composed_metrics_df['Model_Name'] == model_name]
models_composed_metrics_df[(models_composed_metrics_df['Metric'] == 'Accuracy_Difference') | (models_composed_metrics_df['Metric'] == 'Equalized_Odds_FPR') | (models_composed_metrics_df['Metric'] == 'Equalized_Odds_FNR')]

***Briefly write your answer in the Google form***