### Deprecated: 
This file is no more utilized. The members of the project decided to move on the analysis only with KNN models. To extract data and informations of the datasets, we now utilize the 'knn_exploration.ipynb' or 'knn_exploration.py', which provide better logs, more informations about knn and a simple code.

# Data Exploration
<div style="text-align: justify">
In this phase, we colect various metrics including precision, F1 score, recall, accuracy, and ROC AUC OVR across different models such as decision trees, KNNs, and SVMs. We employ GridSearchCV to explore combinations of hyperparameters. The goal is to assess a CSV containing metrics for each model and hyperparameter set, discerning their performance variations across tasks sourced from the OpenML repository.

To streamline the process, we partition the OpenML-CC18 Curated classification dataset into segments, distinguishing between multiclass datasets, balanced binary datasets, and imbalanced binary datasets. We employ a threshold criterion, set at 0.3, to determine whether a binary dataset is balanced or imbalanced. Specifically, if one class constitutes less than or equal to 30% of the total targets, the dataset is classified as imbalanced.
</div>

In [2]:
import openml
import os
import pandas as pd
from pandas import DataFrame
import numpy as np
import warnings


# logging.basicConfig(filename='benchmark.log', level=logging.INFO)
warnings.filterwarnings("ignore")

The function separate_dataset_characteristics() is responsible to split the datasets of OpenML-CC18 into
- Disbalanced binary tasks
- Balanced binary tasks
- Multiclasstasks
  
Futhermore, in this function, we can filter datasets by it number of rows.

In [3]:
def separate_dataset_characteristics(benchmark: str ="OpenML-CC18", disbalance_threshold: float = 0.3) -> dict:
    benchmark_suite = openml.study.get_suite(benchmark)
    subset_benchmark_suite = benchmark_suite.tasks[0:50]
    disbalanced_binary_tasks = []
    balanced_binary_tasks = []
    multiclass_tasks = []

    for task_id in subset_benchmark_suite:
        task = openml.tasks.get_task(task_id)
        _, targets = task.get_X_and_y()
        num_classes = len(np.unique(targets))
        
        num_instances = len(targets)
        if num_instances > 5000:
            print(f"Dataset {task_id} too big. Discarted")
            continue

        if num_classes == 2:  # Binary classification task
            minority_fraction = pd.Series(targets).value_counts(normalize=True).min()
            if minority_fraction < disbalance_threshold:  disbalanced_binary_tasks.append(task_id)
            else: balanced_binary_tasks.append(task_id)
            continue

        multiclass_tasks.append(task_id)

    return {
        "disbalanced_binary_tasks": disbalanced_binary_tasks,
        "balanced_binary_tasks": balanced_binary_tasks,
        "multiclass_tasks": multiclass_tasks
    }

In [None]:
tasks = separate_dataset_characteristics()
print("Disbalanced binary tasks: ", tasks['disbalanced_binary_tasks'])
print("Balanced binary tasks: ", tasks['balanced_binary_tasks'])
print("Multiclass tasks: ", tasks['multiclass_tasks'])

In [5]:
from sklearn.model_selection import GridSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline
from sklearn.impute import SimpleImputer
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC

<div style="text-align: justify">
The function filter_columns() is responsible for selecting the key metrics essential for generating the CSV. It is optional. built just to facilitate the analysis process by focusing on pertinent information. In practice, it filters the metrics for each fold of cross-validation and retains only the mean metrics.
</div>

In [None]:
def filter_columns(results: DataFrame) -> DataFrame:
    keep_columns = ['Model', 'Dataset', 'mean_test_accuracy', 'mean_test_precision', 'mean_test_recall',
                    'mean_test_f1', 'mean_test_roc_auc_ovr', 'Data_type']
    keep_columns += [col for col in results.columns if col.startswith('param_')]
    results = results[keep_columns]
    return results

<div style="text-align: justify">
The function run_benchmark() trains and evaluates all subsets of models using different hyperparameters. It then incorporates metrics, the number of tasks, dataset type (extracted separately using separate_dataset_characteristics()), and the analyzed model into a dataframe.
</div>

In [6]:
def run_benchmark(model: any, model_name: str, params: dict = None, metrics: list = None, tasks: list = None, tasks_description: str = None) -> DataFrame:
    print(f"\nEvaluating metric for {model_name} model")
    results_list = []

    if tasks is None or tasks == []: return 

    for task_id in tasks:
        print(f"Started task {task_id}")
        task = openml.tasks.get_task(task_id)
        features, targets = task.get_X_and_y()
        
        grid_search = GridSearchCV(model, params, cv=10, scoring=metrics, refit=False, n_jobs=-1)
        grid_search.fit(features, targets)

        results = pd.DataFrame(grid_search.cv_results_)
        results['Dataset'] = task_id
        results['Data_type'] = tasks_description
        results['Model'] = model_name
        results_list.append(results)

    all_results = pd.concat(results_list, ignore_index=True)
    print("Finalized evaluation\n")
    return filter_columns(all_results)

In [7]:
def concat_list_of_dataframes(list_of_dataframes: list) -> DataFrame:
    if list_of_dataframes: return pd.concat(list_of_dataframes, ignore_index=True)




def create_csv(dataframe, name) -> None:
    path = os.path.join(os.getcwd(), '../csv_files/')
    if not dataframe.empty:
        if not os.path.exists(path):
            os.makedirs(path)
        full_path = os.path.join(path, name)
        dataframe.to_csv(full_path, index=False)
        print(f"CSV file '{full_path}' saved successfully.")
    else:
        print("No dataframe provided.")


<div style="text-align: justify">
Here, we instantiate the models to run the benchmark, then select the metrics for evaluation and the hyperparameters of each model to be combined in gridSearchCV.
</div>

In [8]:
dt = make_pipeline(SimpleImputer(strategy='constant'),DecisionTreeClassifier()) # Decision trees are not sensible to non scaling values
knn = make_pipeline(SimpleImputer(strategy='constant'),StandardScaler(),KNeighborsClassifier())
svm = make_pipeline(SimpleImputer(strategy='constant'),StandardScaler(),SVC())

dt = make_pipeline(SimpleImputer(strategy='constant'),DecisionTreeClassifier()) # Decision trees are not sensible to non scaling values
knn = make_pipeline(SimpleImputer(strategy='constant'),StandardScaler(),KNeighborsClassifier())
svm = make_pipeline(SimpleImputer(strategy='constant'),StandardScaler(),SVC(probability=True))
metrics = ['accuracy', 'precision', 'recall', 'f1', 'roc_auc_ovr']
models = [dt, knn, svm]
model_names = ['dt', 'knn', 'svm']

params = {
    'dt': {
        'decisiontreeclassifier__criterion': ['gini', 'entropy'],  
        'decisiontreeclassifier__max_depth': [5, 7, 9],  
        'decisiontreeclassifier__min_samples_split': [3, 4, 5],  
        'decisiontreeclassifier__min_samples_leaf': [2, 3, 4]
    },
    'knn': {
        'kneighborsclassifier__n_neighbors': [3, 5, 7, 9, 11],  
        'kneighborsclassifier__weights': ['uniform', 'distance'],  
    },
    'svm': {
        'svc__kernel': ['linear', 'poly', 'rbf', 'sigmoid'],  
        'svc__gamma': ['scale', 'auto']  
    }
}

<div style="text-align: justify">
Finally, we run the benchmark, generating three CSV files, one for each model. Each CSV contains information on metrics, tasks, and parameters for each subset of datasets that we split at the beginning, such as balanced binary, imbalanced binary, and multiclass. It's important to note that we set gridSearchCV to run with n_jobs=-1, utilizing all available cores to run in parallel. However, even with this setup, calculating all the metrics for SVMs on a large quantity of datasets consumes significant time and resources. Therefore, we analyze only a portion of the datasets to observe the nature of SVMs.
</div>

In [None]:
try:
        for model, model_name in zip(models, model_names):
            metric_results={}
            print(f"\nModel Name: {model_name}")
            print(f"Keys in params: {params.keys()}")
            results = []

            results.append(run_benchmark(model, model_name, params=params[model_name], metrics=metrics, tasks=tasks['disbalanced_binary_tasks'], tasks_description='disbalanced_binary_tasks'))
            results.append(run_benchmark(model, model_name, params=params[model_name], metrics=metrics, tasks=tasks['balanced_binary_tasks'], tasks_description='balanced_binary_tasks'))
            results.append(run_benchmark(model, model_name, params=params[model_name], metrics=metrics, tasks=tasks['multiclass_tasks'], tasks_description='multiclass_tasks'))
            metric_results[model_name] = results

            concatenated_df = concat_list_of_dataframes(metric_results[model_name])
            if not concatenated_df.empty:
                create_csv(concatenated_df, f"metrics_{model_name}.csv")
                print(f"Created csv metrics_{model_name}.csv")
            else:
                print(f"No dataframes to concatenate for {model_name}.")
except Exception as e:
        print(f"An error occurred: {e}")