# Dataset Collection: Training Set Creation (Length 200, RQ1/2)

---

This notebook constructs the training dataset that will later be used to train and evaluate the predictive model.  

The **input file** is:  
- `dataset_sequences_200.pkl` (containing the raw sequences)

The **output file** generated by this notebook is:  
- `dataset_training_200.csv`

---

## Dataset for RQ1 – Feasibility of Model

- The **most comparison-efficient sorting algorithm** for each sequence is determined, serving as the target label.  
- As input features, the **presortedness metrics** are computed across the entire sequence for all five metrics under consideration.
- The **number of comparisons required** is not taken into account in this setting, as the primary objective is to demonstrate that sorting algorithms can be predicted from presortedness metrics alone.  

---

## Dataset for RQ2 – Efficiency of Presortedness Metric Combinations

- To investigate which **combination of presortedness metrics** provides the most efficient prediction, **sampled metrics** are employed rather than full-sequence metrics.  
- Specifically, a **sample size of 10** is used, with values drawn at intervals of 20 positions in the sequence (denoted as **`arr_dist`**).  
- For the evaluation of presortedness metric combinations, the **number of comparisons required** to compute the metrics is recorded, as this reflects the computational overhead of feature extraction.  

---


## Requirements

In [1]:
%pip install pandas==2.2.3

Defaulting to user installation because normal site-packages is not writeable


## Training Set Creation

In [2]:
import pickle
import os
import pandas as pd
import sys

from intellisorts_sorting_algorithms import *
from intellisorts_presortedness_metrics import *

results = []

with open('dataset_sequences_200.pkl', 'rb') as f:
    dataset_dfs = pickle.load(f)

total = len(dataset_dfs)
for i, (key, df) in enumerate(dataset_dfs.items(), start=1):
    sys.stdout.write(f"\rProcessing dataset {i}/{total}: {key}")
    sys.stdout.flush()
    
    for column in df.columns:
        arr = df[column].values
        if len(arr) < 200:
            continue

        # sorting algorithm comparison calculation
        comp_merge = merge_sort(arr.copy())
        comp_quick = quicksort(arr.copy())
        comp_intro = introsort(arr.copy())
        comp_insertion = insertion_sort(arr.copy())
        comp_tim = timsort(arr.copy())

        comparison_counts = {
            'introsort': comp_intro,
            'insertion_sort': comp_insertion,
            'merge_sort': comp_merge,
            'timsort': comp_tim,
            'quick_sort': comp_quick,
        }

        min_algorithm = min(comparison_counts, key=comparison_counts.get)
        min_comparisons = comparison_counts[min_algorithm]

        # RQ1 full array presortedness
        runs_val = runs(arr)
        inversions_val, dis_val = inv_dis(arr)
        deletions_val = deletions(arr)

        # RQ2 distributed sampling strategy for presortedness combination assessment
        sample_size = 10
        step = len(arr) // sample_size
        arr_dist = [arr[i] for i in range(0, len(arr), step)]
        runs_val_dist = runs(arr_dist)
        inversions_val_dist, dis_val_dist = inv_dis(arr_dist)
        deletions_val_dist = deletions(arr_dist)
        inv_dis_comp_dist = inv_dis_comp(arr_dist)
        deletions_comp_dist = deletions_comp(arr_dist)
        runs_comp_dist = runs_comp(arr_dist)

        results.append({
            'Dataset': key,
            'Column': column,
            'Algorithm': min_algorithm,
            'Comparisons': min_comparisons,
            'Runs': runs_val,
            'Inversions': inversions_val,
            'Deletions': deletions_val,
            'Dis': dis_val,

            # RQ2
            'Inversions_dist': inversions_val_dist,
            'Dis_dist': dis_val_dist,
            'Inversions_Dis_dist_comp': inv_dis_comp_dist,
            
            'Deletions_dist': deletions_val_dist,
            'Deletions_dist_comp': deletions_comp_dist,
            
            'Runs_dist': runs_val_dist,
            'Runs_dist_comp': runs_comp_dist,
    
            'introsort': comp_intro,
            'insertion_sort': comp_insertion,
            'merge_sort': comp_merge,
            'timsort': comp_tim,
            'quick_sort': comp_quick,
        })

df_results = pd.DataFrame(results)
print(df_results)
df_results.to_csv('dataset_training_200.csv')

               Dataset                                             Column  \e weekeek
0   census_tracts_2010                                              GEOID   
1   census_tracts_2010                                              POP10   
2   census_tracts_2010                                               HU10   
3   census_tracts_2010                                              ALAND   
4   census_tracts_2010                                             AWATER   
5   census_tracts_2010                                         ALAND_SQMI   
6   census_tracts_2010                                        AWATER_SQMI   
7   census_tracts_2010                                           INTPTLAT   
8   census_tracts_2010  INTPTLONG                                     ...   
9     kenpom-ncaa-2025                                         Unnamed: 0   
10    kenpom-ncaa-2025                                                 Rk   
11    kenpom-ncaa-2025                                             