# V1 Resource Monitor Dataset

## Overview

The following code is used to create a dataset consisting of collections of different laboratory equipment registries. Each collection (folder) has 4 spreadsheets. One containing the instrument spreadsheet, which contains information on all the different instrument types, then 2 spreadsheets specific to each instrument type (thermocycler or heater-shaker), then one showing how each record has been corrupted. The corruptions are listed based on the Reference Number of each instrument. This is never corrupted. 

All other values like regular ID, serial number, min/max temp, etc... are corrupted. This can be by having duplicate values or swapping mins and maxs.

Each collection varies based on its corruption percentage.

The dataset is structured as follows:
- Dataset Directory
    - corrupt_perc_0
        - instrument.xlsx
        - thermocycler.xlsx
        - heatershaker.xlsx
        - corruption_summary.xlsx
    - corrupt_perc_10
        - instrument.xlsx
        - thermocycler.xlsx
        - heatershaker.xlsx
        - corruption_summary.xlsx
    - Collection n...

## Imports

General Imports

In [9]:
import os
import pandas as pd
from datetime import datetime

Seuqnece Generator Imports

In [10]:
from lab_equipment_registry_generator import Equipment as Equipment
from lab_equipment_registry_generator import EquipmentCorrupt as Corrupt

## Dataset Generation

In [11]:
def create_dataset(parent_dir, num_folders):
    """
    Creates a dataset with the specified number of folders, each containing Excel files with corrupted data.

    Parameters
    ----------
    parent_dir : str
        The parent directory where the dataset folders will be created.
    num_folders : int
        The number of folders to create, with corruption levels incrementing evenly.
    """
    # Ensure the parent directory exists.
    if not os.path.exists(parent_dir):
        os.makedirs(parent_dir)

    # Create n number of subfolders with evenly incrementing corruption levels.
    for i in range(num_folders):
        corruption_pct = i / (num_folders - 1)
        corruption_pct_str = f"{int(corruption_pct * 100)}"
        folder_name = f"corrupt_perc_{corruption_pct_str}"
        folder_path = os.path.join(parent_dir, folder_name)
        
        if not os.path.exists(folder_path):
            os.makedirs(folder_path)
        
        # Generate the instrument table with 200 rows.
        instrument_df = Equipment.generate_instruments(200, include_thermocyclers=True, include_heater_shakers=True)
        thermocycler_df = Equipment.thermocycler(instrument_df)
        heatershaker_df = Equipment.heatershaker(instrument_df)

        # Corrupt the data.
        instrument_df, thermocycler_df, heatershaker_df, corruption_summary = Corrupt.run_all_corruptions(
            instrument_df, thermocycler_df, heatershaker_df, corruption_pct
        )

        # Save the dataframes to Excel files.
        instrument_df.to_excel(os.path.join(folder_path, f'instrument_{corruption_pct_str}.xlsx'), index=False)
        thermocycler_df.to_excel(os.path.join(folder_path, f'thermocycler_{corruption_pct_str}.xlsx'), index=False)
        heatershaker_df.to_excel(os.path.join(folder_path, f'heatershaker_{corruption_pct_str}.xlsx'), index=False)
        corruption_summary.to_excel(os.path.join(folder_path, f'corruption_summary_{corruption_pct_str}.xlsx'), index=False)

In [12]:
# Run function to make dataset.
parent_dir = "Resource_Monitor_Dataset_V1"
num_folders = 11 
create_dataset(parent_dir, num_folders)