### L2 regularisation experiment

This notebook runs an experiment to understant the effect of the L2 regularisation on the predicitons of matfact.  
The risk state are labeled with integers from 1 to 4: [1: Normal, 2: LowRisk, 3: HighRisk, 4: Cancer]  
Since the data is highly imbalanced towards Normal and Low risk states, there are a majority of 1 and 2 labels in the datasets.  
L2 regularisation on both U and V might promote lower values (Labels 1 and 2) in M.  

This experiment logs with mlflow the matfact results with increasing regularisation parameters for U and V with a synthetic dataset.  
Then, using the same dataset, the distribution of the labels is inverted so that the higher risk is represented by labels 1 and 2 and the lower risks by labels 3 and 4. This way the imbalance is also inverted having a mayority of labels 4 and 3.  
The results of the matfact are also logged to be later compared with visualisations.  


While running the experiment the confusion matrix for each different combination of regularisation parameters are generated and saved into an image in the results directory. 
The rest of the visualisations (matthew, accuracy, recall, precision) are generated at the end and saved in the same directory.

In [4]:
import mlflow
from mlflow import MlflowClient
from mlflow.entities import ViewType

from l2_regularization_experiments import run_l2_regularization_experiments

print(mlflow.get_tracking_uri())

file:///Users/martaq/Develop/decipher/matfact/experiments/mlruns


In [3]:
from matfact.data_generation.dataset import Dataset

In [4]:
import  numpy as np
import logging
INVERTED = "inverted"

def log_dataset_info(dataset, inverted=False):
    preffix = INVERTED if inverted else ""
    logging.info(f"{preffix}dataset X 0: {sum(dataset.X==0)}")
    logging.info(
        f"{preffix}dataset X histogram: {np.histogram(dataset.X, bins=[0,1,2,3,4,5])}"
    )
    logging.info(
        f"{preffix}dataset M histogram: {np.histogram(dataset.M, bins=[1,2,3,4])}"
    )
    logging.info(f"{preffix}dataset metadata: {dataset.metadata}")

def invert_domain(X, labels):
    # Inverts matrix label distributions
    return X.max() - (X - X.min())


def invert_dataset(dataset, labels):
    # Inverts dataset label distributions
    log_dataset_info(dataset)
    inv_M = invert_domain(dataset.M.copy(), labels)
    inv_X = dataset.X.copy()
    inv_X[inv_X > 0] = invert_domain(inv_X[inv_X > 0], labels)
    inv_metadata = dataset.metadata.copy()
    inv_metadata["observation_probabilities"] = [0.01, 0.04, 0.12, 0.08, 0.03]
    inv_dataset = Dataset(inv_X, inv_M, inv_metadata)
    log_dataset_info(inv_dataset, inverted=True)
    return inv_dataset

In [5]:
normal_dataset = Dataset.generate(N=10000, T=100, rank=5, sparsity_level=100)
inv_dataset = invert_dataset(normal_dataset, [1,2,3,4])

In [6]:
# Run experiments with increasing parameters for the U and V l2 regularisations.
lambda_values = [0,3,9,18,21,63,126,189]
run_l2_regularization_experiments(lambda_values)

INFO:root:dataset X 0: [9997    0    0    0    1    4    7   13   27   41   59   90  121  151
  184  240  291  352  418  497  581  675  790  913 1046 1181 1335 1485
 1649 1833 2017 2196 2405 2605 2802 3017 3239 3458 3686 3913 4141 4374
 4593 4851 5116 5355 5589 5837 6110 6311 6534 6738 6960 7150 7366 7548
 7738 7932 8086 8233 8394 8534 8669 8791 8928 9033 9146 9251 9349 9430
 9494 9553 9603 9648 9692 9730 9768 9811 9841 9868 9895 9916 9937 9951
 9963 9971 9976 9982 9985 9987 9992 9993 9996 9997 9997 9997 9997 9997
 9997 9997]
INFO:root:dataset X histogram: (array([565937, 277243, 153487,   2985,     48]), array([0, 1, 2, 3, 4, 5]))
INFO:root:dataset M histogram: (array([915438,  81810,   2452]), array([1, 2, 3, 4]))
INFO:root:dataset metadata: {'rank': 5, 'sparsity_level': 100, 'N': 10000, 'T': 100, 'generation_method': 'DGD', 'number_of_states': 4, 'observation_probabilities': [0.01, 0.03, 0.08, 0.12, 0.04]}
INFO:root:inverteddataset X 0: [9997    0    0    0    1    4    7   13   27 

In [None]:
# Check existing experiments (remove if necessary)
client = MlflowClient()
for e in client.search_experiments(ViewType.ALL):
    print(e)
    client.delete_experiment(e.experiment_id)

# Clean experiments?
# client = MlflowClient()
# client.delete_experiment("658405861631685322")

<Experiment: artifact_location='file:///Users/martaq/Develop/decipher/matfact/experiments/mlruns/473358499169108146', creation_time=1671194543685, experiment_id='473358499169108146', last_update_time=1671194557029, lifecycle_stage='deleted', name='exp_normal_221216_134223', tags={}>
<Experiment: artifact_location='file:///Users/martaq/Develop/decipher/matfact/experiments/mlruns/308648439440736948', creation_time=1671187168160, experiment_id='308648439440736948', last_update_time=1671194557030, lifecycle_stage='deleted', name='exp_inverted_221216_113928', tags={}>
<Experiment: artifact_location='file:///Users/martaq/Develop/decipher/matfact/experiments/mlruns/202780764438619067', creation_time=1671186794638, experiment_id='202780764438619067', last_update_time=1671194557030, lifecycle_stage='deleted', name='exp_normal_221216_113314', tags={}>
<Experiment: artifact_location='file:///Users/martaq/Develop/decipher/matfact/experiments/mlruns/291375103599013490', creation_time=1671019182924,