Working with datasets subpackage
--------------------------------
The ``datasets`` subpackage is designed to provide robust and flexible data loading and management functionalities tailored for machine learning models. 
This tutorial will guide you through using this subpackage to handle and prepare your data efficiently.


## Using the DatasetsManager Class
The `DatasetsManager` class in the `MED3pa.datasets` submodule is designed to facilitate the management of various datasets needed for model training and evaluation. This tutorial provides a step-by-step guide on setting up and using the `DatasetsManager` to handle data efficiently.

### Step 1: Importing the DatasetsManager
First, import the `DatasetsManager` from the `MED3pa.datasets` submodule:


In [1]:
import sys
import os

sys.path.insert(0, os.path.abspath(os.path.join(os.getcwd(), '..')))

from MED3pa.datasets import DatasetsManager


### Step 2: Creating an Instance of DatasetsManager
Create an instance of `DatasetsManager`. This instance will manage all operations related to datasets:


In [2]:
manager = DatasetsManager()


### Step 3: Loading Datasets
With the `DatasetsManager`, you can load various segments of your base model datasets, such as training, validation, reference, and testing datasets. You don't need to load all datasets at once. Provide the path to your dataset and the name of the target column:

#### Loading from File


In [3]:
manager.set_from_file(dataset_type="training", file='./data/train_data.csv', target_column_name='Outcome')


#### Loading from NumPy Arrays
You can also load the datasets as NumPy arrays. For this, you need to specify the features, true labels, and column labels as a list (excluding the target column) if they are not already set.


In [4]:
import numpy as np
import pandas as pd

df = pd.read_csv('./data/val_data.csv')

# Extract labels and features
X_val = df.drop(columns='Outcome').values
y_val = df['Outcome'].values

# Example of setting data from numpy arrays
manager.set_from_data(dataset_type="validation", observations=X_val, true_labels=y_val)


### Step 4: Ensuring Feature Consistency
Upon loading the first dataset, the `DatasetsManager` automatically extracts and stores the names of features. You can retrieve the list of these features using:


In [5]:
features = manager.get_column_labels()
print("Extracted features :", features)

Extracted features : ['Pregnancies', 'Glucose', 'BloodPressure', 'SkinThickness', 'Insulin', 'BMI', 'DiabetesPedigreeFunction', 'Age']


### Step 5: Retrieving Data
Retrieve the loaded data in different formats as needed.

#### As NumPy Arrays


In [6]:
observations, labels = manager.get_dataset_by_type(dataset_type="training")

# Print the shape of features and labels to verify they have been loaded
print(f"Observations shape: {observations.shape}")
print(f"Labels shape: {labels.shape}")

print("\nFirst 5 rows of features:")
print(observations[:5])

print("\nFirst 5 labels:")
print(labels[:5])

Observations shape: (537, 8)
Labels shape: (537,)

First 5 rows of features:
[[-0.8362943  -0.80005088 -0.53576428 -0.15714558 -0.18973183 -1.06015343
  -0.61421636 -0.94861028]
 [ 0.39072767 -0.49054341  0.12804365  0.55361931  2.13020339  0.64646721
  -0.90973787 -0.43466673]
 [-1.14304979  0.43797901 -0.09322566  1.39361417  1.47853619  1.35537117
  -0.30699103 -0.77729576]
 [ 0.08397217  0.31417602 -0.09322566  0.03669939  0.74866893  0.14760887
  -0.90681191 -0.43466673]
 [-0.8362943  -0.5524449  -2.19528409  1.13515422  0.02749057  1.48664968
  -0.83951493 -0.00638043]]

First 5 labels:
[0 0 1 0 0]


#### As a MaskedDataset Instance
To work with the data encapsulated in a `MaskedDataset` instance, which includes more functionalities, retrieve it by setting `return_instance` to `True`:


In [7]:
training_dataset = manager.get_dataset_by_type(dataset_type="training", return_instance=True)


### Step 6: Getting a Summary
You can print a summary of the `DatasetsManager` to see the status of the datasets:


In [8]:
manager.summarize()


training_set: {'num_samples': 537, 'num_observations': 8, 'has_pseudo_labels': False, 'has_pseudo_probabilities': False, 'has_confidence_scores': False}
validation_set: {'num_samples': 115, 'num_observations': 8, 'has_pseudo_labels': False, 'has_pseudo_probabilities': False, 'has_confidence_scores': False}
reference_set: Not set
testing_set: Not set
column_labels: ['Pregnancies', 'Glucose', 'BloodPressure', 'SkinThickness', 'Insulin', 'BMI', 'DiabetesPedigreeFunction', 'Age']


### Step 7: Saving and Resetting Datasets
You can save a specific dataset to a CSV file or reset all datasets managed by the `DatasetsManager`.

#### Save to CSV


In [9]:
manager.save_dataset_to_csv(dataset_type="training", file_path='./data/saved_train_data.csv')


#### Reset Datasets


In [10]:
manager.reset_datasets()
manager.summarize()  # Verify that all datasets are reset


training_set: Not set
validation_set: Not set
reference_set: Not set
testing_set: Not set
column_labels: Not set


## Using the MaskedDataset Class
The `MaskedDataset` class, a crucial component of the `MED3pa.datasets` submodule, facilitates nuanced data operations that are essential for custom data manipulation and model training processes. This tutorial details common usage scenarios of the `MaskedDataset`.


### Step 1: Importing Necessary Modules
Begin by importing the `MaskedDataset` and `DatasetsManager`, along with NumPy for additional data operations:


In [11]:
from MED3pa.datasets import MaskedDataset, DatasetsManager
import numpy as np


### Step 2: Loading Data with DatasetsManager
Retrieve the dataset as a `MaskedDataset` instance:


In [12]:
manager = DatasetsManager()
manager.set_from_file(dataset_type="training", file='./data/train_data.csv', target_column_name='Outcome')
training_dataset = manager.get_dataset_by_type(dataset_type="training", return_instance=True)


### Step 3: Performing Operations on MaskedDataset
Once you have your dataset loaded as a `MaskedDataset` instance, you can perform various operations:

#### Cloning the Dataset
Create a copy of the dataset to ensure the original data remains unchanged during experimentation:


In [13]:
cloned_instance = training_dataset.clone()


### Sampling the Dataset
You can either sample the data uniformely, or randomely. The uniform sampling prioritize the least sampled element which is useful when we sample a dataset multiple times, while the random sampling samples the data randomely :


In [14]:
sampled_instance = training_dataset.sample_uniform(N=20, seed=42)
sampled_instance_rand = training_dataset.sample_random(N=20, seed=42)

### Refining the Dataset
Refine the dataset based on a boolean mask, which is useful for filtering out unwanted data points:


In [15]:
mask = np.random.rand(len(training_dataset)) > 0.5
remaining_samples = training_dataset.refine(mask=mask)

print("Remaining samples", remaining_samples)

Remaining samples 259


### Setting Pseudo Labels and Probabilities
Set pseudo labels and probabilities for the dataset, for this you only need to pass the pseudo_probabilities along with the threshold to extract the pseudo_labels from:


In [16]:
pseudo_probs = np.random.rand(len(training_dataset))
training_dataset.set_pseudo_probs_labels(pseudo_probabilities=pseudo_probs, threshold=0.5)


### Getting Feature Vectors and Labels
Retrieve the feature vectors, true labels, and pseudo labels:


In [17]:
observations = training_dataset.get_observations()
true_labels = training_dataset.get_true_labels()
pseudo_labels = training_dataset.get_pseudo_labels()

# Print the shape of features, true labels, and pseudo labels to verify they have been loaded
print(f"Features shape: {observations.shape}")
print(f"True Labels shape: {true_labels.shape}")
print(f"Pseudo Labels shape: {pseudo_labels.shape}")

# Optionally, print the first few rows of features and labels
print("\nFirst 5 rows of features:")
print(observations[:5])

print("\nFirst 5 true labels:")
print(true_labels[:5])

print("\nFirst 5 pseudo labels:")
print(pseudo_labels[:5])


Features shape: (259, 8)
True Labels shape: (259,)
Pseudo Labels shape: (259,)

First 5 rows of features:
[[ 0.39072767 -0.49054341  0.12804365  0.55361931  2.13020339  0.64646721
  -0.90973787 -0.43466673]
 [-1.14304979  0.43797901 -0.09322566  1.39361417  1.47853619  1.35537117
  -0.30699103 -0.77729576]
 [ 0.08397217  0.31417602 -0.09322566  0.03669939  0.74866893  0.14760887
  -0.90681191 -0.43466673]
 [-0.22278332  0.22132378  0.45994761 -1.3202154  -0.6936878  -1.42773326
  -0.59080872  1.87807928]
 [-1.14304979  0.12847154 -0.09322566 -1.3202154  -0.6936878  -0.95513062
  -0.77221796 -1.03426754]]

First 5 true labels:
[0 1 0 0 0]

First 5 pseudo labels:
[False  True  True False  True]


### Getting Confidence Scores
Get the confidence scores if available:


In [18]:
confidence_scores = np.random.rand(len(training_dataset))
training_dataset.set_confidence_scores(confidence_scores=confidence_scores)

confidence_scores = training_dataset.get_confidence_scores()
print(f"Confidence Scores shape: {confidence_scores.shape}")

# Optionally, print the first few confidence scores
print("\nFirst 5 confidence scores:")
print(confidence_scores[:5])


Confidence Scores shape: (259,)

First 5 confidence scores:
[0.23590603 0.16552051 0.18632088 0.83749079 0.33214641]


### Saving the dataset
You can save the dataset as a .csv file, but using `save_to_csv` and providing the path this will save the observations, true_labels, pseudo_labels and pseudo_probabilities, alongside confidence_scores if they were set.

In [19]:
training_dataset.save_to_csv("./data/saved_from_masked.csv")

### Getting Dataset Information
Get detailed information about the dataset, or you can directly use `summarize`:


In [20]:
training_dataset.summarize()


Number of samples: 259
Number of observations: 8
Has pseudo labels: True
Has pseudo probabilities: True
Has confidence scores: True
