# Training Set Creation (Final Model)

---

This notebook constructs the training dataset that will later be used to build and evaluate the final model.  

The **input file** is:  
- `dataset_sequences_10k.pkl` (containing the raw sequences) 

The **output file** generated by this notebook is:  
- `dataset_training_400p.csv`

---

## Metrics Required

In order to train the model, it is necessary to determine the **most comparison-efficient sorting algorithm** for each sequence, which serves as the target label. As input features, we employ the **sampled presortedness metrics** (**Runs** and **Deletions**) together with the **sequence length**.  

For the evaluation of the model, we additionally record the **number of comparisons required** to compute the presortedness metrics, as this reflects the computational overhead associated with feature extraction.  

---

## Sampling Strategies

To address **RQ3**, the notebook implements different sampling strategies and sample sizes. Specifically, the following methods can be applied:  

- `sampling_10_step(sample_size)`  
- `sampling_30_step(sample_size)`  

These procedures allow us to investigate how different sampling strategies and sample sizes influence the model.

---


## Requirements

In [1]:
%pip install pandas==2.2.3
%pip install numpy==1.26.0

Defaulting to user installation because normal site-packages is not writeable
Defaulting to user installation because normal site-packages is not writeable


## Training Set Creation

In [2]:
import pickle
import os
import pandas as pd
import numpy as np
import sys

from intellisorts_sorting_algorithms import *
from intellisorts_presortedness_metrics import *

results = []

with open('dataset_sequences_10k.pkl', 'rb') as f:
    dataset_dfs = pickle.load(f)

total = len(dataset_dfs)
for i, (key, df) in enumerate(dataset_dfs.items(), start=1):
    sys.stdout.write(f"\rProcessing dataset {i}/{total}: {key}")
    sys.stdout.flush()
    
    for column in df.columns:
        arr = df[column].values
        if len(arr) < 400:
            continue
        
        # sorting algorithm comparison calculation
        comp_merge = merge_sort(arr.copy())
        comp_insertion = insertion_sort(arr.copy())
        comp_tim = timsort(arr.copy())
                
        comparison_counts = {
            'insertion_sort': comp_insertion,
            'merge_sort': comp_merge,
            'timsort': comp_tim,
        }

        min_algorithm = min(comparison_counts, key=comparison_counts.get)
        min_comparisons = comparison_counts[min_algorithm]

        # RQ3 sampling_10_step and sampling_30_step have been used to evaluate optimal sampling strategy and sample size

        def sampling_10_step(sample_size):
            return [arr[i] for i in range(0, sample_size * 10, 10)]

        def sampling_30_step(sample_size):
            step = 30

            sample = arr[: (sample_size - 10) * step : step]

            reverse_sampled = arr[-10 * step::step]
    
            result = np.concatenate((sample, reverse_sampled))
            
            return result

        # RQ4 sampling for final model

        def samplingDistStatic():
            first_20 = arr[:200:10]

            middle_start = len(arr) // 2  
            middle_10 = arr[middle_start:middle_start + 100 :10]

            last_10 = arr[-100::10]

            return np.concatenate([first_20, middle_10, last_10])

        results.append({
            'Dataset': key,
            'Column': column,
            'Algorithm': min_algorithm,
            'Comparisons': min_comparisons,
            
            'deletions_val_distStatic': deletions(samplingDistStatic()),
            'runs_val_distStatic': runs(samplingDistStatic()),
            'deletions_comp_distStatic': deletions_comp(samplingDistStatic()),
            'runs_comp_distStatic': runs_comp(samplingDistStatic()),
            
            'arr_len': len(arr),
            
            'insertion_sort': comp_insertion,
            'merge_sort': comp_merge,
            'timsort': comp_tim,
        })


df_results = pd.DataFrame(results)
print(df_results)
df_results.to_csv('dataset_training_400p.csv')

               Dataset                                             Column  \e weekeek
0   census_tracts_2010                                              GEOID   
1   census_tracts_2010                                              POP10   
2   census_tracts_2010                                               HU10   
3   census_tracts_2010                                              ALAND   
4   census_tracts_2010                                             AWATER   
5   census_tracts_2010                                         ALAND_SQMI   
6   census_tracts_2010                                        AWATER_SQMI   
7   census_tracts_2010                                           INTPTLAT   
8   census_tracts_2010  INTPTLONG                                     ...   
9         commute_data                                              OFIPS   
10        commute_data                                              DFIPS   
11        commute_data                                            O