# Ordinal Archetypal Analysis

## Contents

- [Introduction](##Introduction)

- [Configuration](##Configuration)


- [Synthetic experiments](##Synthetic-experiments)


- [ESS8 experiments](#ess8-experiments)

## 👋 Introduction 👋
This is a demo aimed at introducing the codebase and reproducing the main results presented in the paper *"Ordinal Archetypal Analysis for Modelling Human Response Bias"*.

[Archetypal Analysis](https://digitalassets.lib.berkeley.edu/sdtr/ucb/text/379.pdf) can be expressed in the following form:

\begin{equation} 
\begin{array}{clll}
\min _{\mathbf{C}, \mathbf{S}} & L(\mathbf{X},\mathbf{R})&
\text { s.t. } & \mathbf{R} = \mathbf{XCS} \\
&&&
c_{j, k} \geq 0, \quad s_{k, j} \geq 0 \\
&&& \sum_j c_{j,k} = 1, \quad \sum_k s_{k,j} = 1,
\end{array}
\end{equation}

In the paper we show that this can be tailored specifically to ordinal data and expand the model to include subject specific response bias. We showcase the usefulness of this model on synthetic data with and without response bias and then go on to test the model on real world data, namely data from a European Survey on Human Values. 

The codebase consists of several classes in which the methods are implemented and results are produced and visualized. The endpoint for which to run the code is the $\texttt{main.py}$ file. It is setup to use argparser and supports both running experiments and visualizing results. Analyses are saved to the ./results/ folder and plots in the ./figures/ folder. 

[Fernández et al.](https://www.sciencedirect.com/science/article/pii/S002002552100791X) implemented the TSAA method in R. As a result, the path to the output file of this analysis needs to be specified in order to run these analyses. In this model the data was first preprocessed to countinous scale and then a normal Archetypal Analysis was performed. 
For the synthetic experiments, we provide the outputs of the preprocessing step directly. For the ESS8 dataset, we provide the ordinal mapping of Likert scale points that the preprocessing found after convergence. To use the latter, one has to download the ESS8 dataset and map each entry to the continuous likert scale provided as an additional preprocessing step.


The source code is found in the src folder. We advise that you use the argparser setup when running experiments. If you, however, wish to circumvent this, we recommend looking in src/methods and particularly the $\texttt{self.compute\_archetypes}$ method of each class along with the $\texttt{ResultMaker}$ class.

## 🔨 Configuration 🔨

To keep hyperparameters and paths managable .yaml files are used for configuring experiments. Some of the parameters are covered in this demo - please consult $\texttt{config.yaml}$ and the inline comments for an exhaustive overview. To keep the number of configuration files managable parts of the config are populated via the argparse in main.py.

If you are ever in doubt about what parameters were used during an experiment refer to the "experiment_config.json" file found under the --save-folder specified when running $\texttt{main.py}$.

In [3]:
def print_data_params(cfg: dict):
    if cfg['data']['use_synthetic_data']:
        print("\nQuestionnaire data will be synthetic with parameters:")
        [print(f"{k}: {v}") for k, v in cfg['data']['synthetic_data_params'].items()]
    else:
        print(f"Questionnaire data will be loaded from {cfg['data']['input_data_path']}")

    if cfg['data']['do_corrupt']:
        print(f"Data will be corrupted with probability {cfg['data']['p_corrupt']}")

    print("\nresults will be saved to: ", cfg['data']['results']['checkpoint_dir'])

def print_hyperparams(cfg: dict):
    print("hyperparameters for which all combinations will be tested:\n")
    [print(f'{k}: {v}') for k, v in cfg['training']['parameter_tuning'].items()]
    print("")

## 🔬 Synthetic experiments 🔬

To test the model synthetic data was generated. We ran the main experiments: 

- 1000 subjects and 20 questions
- 1000 subjects and 100 questions. 

Furthermore we corrupted 1% of the data and examined our models susceptibility to noise, as described in the paper. 


The following code shows the key parameters of the experiment. 

In [4]:
from src.misc.read_config import load_config
config_path = 'configs/config.yaml'
synthetic_cfg = load_config('configs/synthetic_config.yaml')
# synthetic_cfg = set_params(synthetic_cfg, synthetic=True, M=20, rb=True, checkpoint_dir='results/synthetic_Q20_RB')
print_data_params(synthetic_cfg)


Questionnaire data will be synthetic with parameters:
a_param: [1]
b_param: [1.5]
sigma: [-9.21]
sigma_dev: [1e-06]
N: 1000
M: None
K: 3
rb: None
p: 5

results will be saved to:  results/..


Here are some key analysis parameters in the current configuration

In [5]:
print_hyperparams(synthetic_cfg)

hyperparameters for which all combinations will be tested:

method: ['CAA', 'TSAA', 'OAA', 'RBOAA']
with_init: [True]
beta_regulators: [True]
alternating: [False]
n_archetypes: [2, 3, 4, 5, 6, 7, 8, 9, 10, 20, 30, 40, 50]



The below command runs analyses with the above hyperparameters

In [6]:
! python main.py analyse --config-path configs/synthetic_config.yaml --save-folder synthetic_Q20_RB --M 20 --rb --OSM-path data/synthetic/X_Q20_RB_OSM.csv

^C



Conventional Arhcetypal Analysis in progress...

|--------------------|   0.76% finished // 6.2 minutes remaining
|--------------------|   1.52% finished // 3.2 minutes remaining
|--------------------|   2.28% finished // 2.2 minutes remaining
|--------------------|   3.04% finished // 1.8 minutes remaining
|--------------------|   3.8% finished // 1.5 minutes remaining
|--------------------|   4.56% finished // 1.3 minutes remaining
|██------------------|   5.32% finished // 1.1 minutes remaining
|██------------------|   6.08% finished // 1.0 minutes remaining
|██------------------|   6.84% finished // 57.2 seconds remaining
Finished Successfully after 4.4 seconds!                               

Analysis ended due to early stopping.

/////////////// INFORMATION ABOUT CONVENTIONAL ARCHETYPAL ANALYSIS \\\\\\\\\\\\\\\\
▣ The Conventional Archetypal Analysis was computed using 2 archetypes.
▣ The Conventional Archetypal Analysis was computed on 20 attributes.
▣ The Conventional Archetyp

To visualize the analysis results run:

In [None]:
! python main.py visualize --config-path configs/synthetic_config.yaml --save-folder synthetic_Q20_RB

  likert_counts = likert_counts.fillna(0)
Successfully created visualizations. The plots have been saved to figures/synthetic_Q20_RB


#### Corruption experiment
To run the data corruption experiment corresponding to the above analysis simply add the --corrupt argument as:

In [None]:
! python main.py analyse --config-path configs/synthetic_config.yaml --rb --M 20 --corrupt --save-folder synthetic_Q20_RB_corrupted --OSM-path data/synthetic/X_Q20_RB_corrupted_OSM.csv
! python main.py visualize --config-path configs/synthetic_config.yaml --save-folder synthetic_Q20_RB_corrupted --corrupt

#### No RB experiment

In [None]:
! python main.py analyse --config-path configs/synthetic_config.yaml --save-folder synthetic_Q20_NoRB --M 20 --OSM-path data/synthetic/X_Q20_NoRB_OSM.csv
! python main.py visualize --config-path configs/synthetic_config.yaml --save-folder synthetic_Q20_NoRB

And to run the experiment with corruption execute the following:

In [None]:
! python main.py analyse --config-path configs/synthetic_config.yaml --M 20 --corrupt --save-folder synthetic_Q20_NoRB_corrupted --OSM-path data/synthetic/X_Q20_NoRB_OSM_corrupted.csv
! python main.py visualize --config-path configs/synthetic_config.yaml --corrupt --save-folder synthetic_Q20_NoRB_corrupted

For M = 100 questions, simply run the above code with --M 100 instead of 20.

In [None]:
# Q100 with RB
! python main.py analyse --config-path configs/synthetic_config.yaml --save-folder synthetic_Q100_RB --M 100 --rb
! python main.py visualize --config-path configs/synthetic_config.yaml --save-folder synthetic_Q20_RB

# Q100 with RB corrupted
! python main.py analyse --config-path configs/synthetic_config.yaml --rb --M 100 --corrupt --save-folder synthetic_Q100_RB_corrupted
! python main.py visualize --config-path configs/synthetic_config.yaml --save-folder synthetic_Q100_RB_corrupted --corrupt

# Q100 without RB
! python main.py analyse --config-path configs/synthetic_config.yaml --save-folder synthetic_Q100_NoRB --M 100
! python main.py visualize --config-path configs/synthetic_config.yaml --save-folder synthetic_Q100_NoRB

# Q100 without RB corrupted
! python main.py analyse --config-path configs/synthetic_config.yaml --M 100 --corrupt --save-folder synthetic_Q100_NoRB_corrupted
! python main.py visualize --config-path configs/synthetic_config.yaml --corrupt --save-folder synthetic_Q100_NoRB_corrupted

## 📊 ESS8 experiments 📊
European Social Survey 2008 ([ESS8](https://ess.sikt.no/en/?tab=overview)). While the RBOAA and OAA models can run on the full data, we chose to select the subpopulation of GB, to allow for comparison with the TSAA model. 

In [None]:
ess8_cfg = load_config('configs/ESS8_config.yaml')
print_data_params(ess8_cfg)
print_hyperparams(ess8_cfg)

In [None]:
!python main.py analyse --config-path configs/ESS8_config.yaml --save-folder ESS8_GB --X-path data/ESS8/ESS8_GB.csv --OSM-path data/ESS8/GB_OSM.csv

!python main.py visualize ----config-path configs/ESS8_config.yaml --save-folder ESS8_GB

ESS8 GB corrupted analysis

In [None]:
!python main.py analyse --config-path configs/ESS8_config.yaml --save-folder ESS8_GB_corrupted --corrupt --X-path data/ESS8/ESS8_GB.csv --OSM-path data/ESS8/GB_data_OSM_corrupted.csv

!python main.py visualize --config-path configs/ESS8_config.yaml --save-folder ESS8_GB_corrupted