# Example on how this framework works
This is just a simple example to show how you should use the experimentation code.
The experiments used in the nCOBRAS paper are in the experiments package.

## Set-up
First of all you should fill the HOMEDIR in `config.py`.
If you want to compare with COSC and MPCK-means you also have to fill in the WEKA_PATH and COSC_PATH.

### Additional requirements for MPCK-means
In order to run MPCK-means you will need WEKA.
In `config.py` fill in the WEKA_PATH (root directory where weka is installed).
You will also need a compatible version of java installed.

### Additional requirements for COSC
In order to compare with COSC you need to download COSC from:
In `config.py` fill in COSC_PATH (the root directory where COSC is installed).
You will also need to have matlab and matlab engine installed for python.

## Imports

In [1]:
import os
import sys
module_path_dont_know = os.path.abspath(os.path.join('../COBRAS_dont_know'))
module_path_testing = os.path.abspath(os.path.join('../COBRAS_testing'))
module_path_noise_robust = os.path.abspath(os.path.join('../noise_robust_cobras'))
if module_path_dont_know not in sys.path:
    sys.path.append(module_path_dont_know)
    print("module path of dont_know added")

if module_path_testing not in sys.path:
    sys.path.append(module_path_testing)
    print("module path of testing added")
    
if module_path_noise_robust not in sys.path:
    sys.path.append(module_path_noise_robust)
    print("module path of noise robust added")
    
import copy
from pathlib import Path
from distributed import Client, LocalCluster
from datasets import Dataset
from evaluate_clusterings.calculate_aligned_rank import calculate_and_write_aligned_rank
from evaluate_clusterings.calculate_aris import calculate_n_times_n_fold_aris_for_testnames
from evaluate_clusterings.calculate_average_aris import calculate_average_aris
from generate_clusterings.clustering_task import ClusteringTask, make_n_run_10_fold_cross_validation
from noise_robust_cobras.querier.noisy_labelquerier import ProbabilisticNoisyQuerier
from noise_robust_cobras.cobras import COBRAS
from config import HOMEDIR, FIGURE_DIR, FOLD_RESULT_DIR
from noise_robust_cobras.strategies.splitlevel_estimation import StandardSplitLevelEstimationStrategy
from noise_robust_cobras.strategies.superinstance_selection import LeastInstancesSelectionHeuristic
from present_results.plot_aligned_rank import plot_rank_comparison_file
from present_results.plot_aris import plot_average_ARI_per_dataset, plot_overall_average_ARI
from run_with_dask.run_with_dask import execute_list_of_clustering_tasks
from before_clustering.generate_folds import generate_folds_for_dataset
from run_locally.run_tests import run_clustering_tasks_locally
TEST_PATH = Path(HOMEDIR)/"example_notebook_results"

module path of dont_know added
module path of testing added
module path of noise robust added


## Generate folds
Before running any experiments you have to generate the test/training sets for all datasets
(you only have to do this once after you do this the folds are stored on disk)

In [2]:
if not Path(FOLD_RESULT_DIR).exists():
     generate_folds_for_dataset()

## Set-up experiments
### Using ClusteringTask directly
So essentially the system works by initialising `ClusteringTask` instances for each clustering operation that you want to execute.
To initialise a clustering task you need to pass all the things necessary to execute a single clustering experiment:
- a clusterer: a clustering algorithm (e.g. COBRAS, COSCMatlab, MPCKMeansJava, your own clusterer)
- a dataset_name this is assumed to be a dataset that is readeable by the `Dataset` class in `dataset.py`
- training indices: the training indices for this experiment
- extra_result_extractor: if you want custom results to be extracted from the clusterer after clustering you can pass a result extractor (a function with signature f(clusterer): Dict)
- querier: the querier that needs to be used for this clustering experiment

In [3]:
data = Dataset('iris')
clusterer = COBRAS()

# labels are filled in when the clustering task is ran
querier = ProbabilisticNoisyQuerier(None, None, 0.10, 100,random_seed=123)

task = ClusteringTask(clusterer, data.name, None, None, querier, TEST_PATH/ 'example_COBRAS_run')
print("done")

done


A ClusteringTask can be run seperately individually.
Usually you will not execute clustering tasks like this. You can execute a collection of clustering tasks through dask.
This way you can easily be executed in parallel over the cores of your local machine or over different hosts through ssh.
This will be illustrated later in this notebook.

**note:** the dataset argument is to be able to run a clusteringtask easily through DASK

In [4]:
task.run(data)

### Constructing n times 10 fold cross validation experiments
If you want to execute n times 10 fold cross validation tests. You don't have to construct all clustering tasks individually.
The make_n_run_10_fold_cross_validation function does this all for you.


In [5]:
robust_clusterer = COBRAS()
querier = ProbabilisticNoisyQuerier(None, None, 0.10, 50)
test_name_robust = "example_ncobras"
# only use 3 datasets
all_dataset_names = ['iris'] #, 'ecoli', 'glass'
clustering_tasks = make_n_run_10_fold_cross_validation(test_name_robust, robust_clusterer, querier, all_dataset_names, 3)

For comparison let's also run COBRAS with no noise correction mechanism

In [6]:
clusterer = COBRAS(correct_noise=False)
test_name = 'example_cobras'
clustering_tasks.extend(make_n_run_10_fold_cross_validation(test_name, clusterer, querier, all_dataset_names, 3))

To execute these tasks you can use Dask:
(for documentation on how to set up dask over different hosts: https://docs.dask.org/en/latest/setup.html)

In [7]:
# a local dask client
#with Client(LocalCluster(n_workers = 2)) as client:
    # this executes all clustering tasks using the given client
    #execute_list_of_clustering_tasks(client, clustering_tasks)
    #execute_list_of_clustering_tasks(clustering_tasks)
run_clustering_tasks_locally(clustering_tasks)

## Gathering results
To gather results and make the plots you can use the following code:

In [8]:
test_names = [test_name_robust, test_name]
comparison_name = 'example_comparison'
nb_of_cores = 4
query_budget = 50
# calculate aris of each clustering result
calculate_n_times_n_fold_aris_for_testnames(test_names, nb_cores=nb_of_cores)
# calculate average ari over all folds per dataset
calculate_average_aris(test_names, query_budget)

# calculates and stores the average aligned rank for each algorithm and each dataset
calculate_and_write_aligned_rank(test_names, comparison_name)

# plot the average ari for each dataset
plot_average_ARI_per_dataset(comparison_name, test_names, test_names)
# plot the overall average ari (over all datasets)
plot_overall_average_ARI(comparison_name, test_names, test_names)
# plot the average aligned ranks
plot_rank_comparison_file(comparison_name, test_names, test_names)

print(f"all resulting plots are in {Path(FIGURE_DIR)/comparison_name}")



  0%|          | 0/2 [00:00<?, ?it/s]

Calculating ARIs for n-times n-fold:  ['example_ncobras', 'example_cobras']
already calculated
Calculating average ARIs


100%|██████████| 2/2 [00:00<00:00,  4.79it/s]


calculating average rank
all resulting plots are in \Users\nicol\Documents\KUL 2020-2021\thesis\code\results\results\figures\example_comparison
