# RQ3: Sampling Strategy

Experiments conducted to investigate how different sampling strategies and sample sizes influence the model.

- Sampling strategy (sample size: 200 - 1000): every 10th element of the sequence until sample size is reached.
- Sampling strategy (sample size: 1200 - 2000): every 30th element from the start and end of the sequence until the sample size is reached.

The **input file** is:  
- the `dataset_sequences/dataset_sequences_10k.pkl` containing the raw sequences.

The **output files** computed by this notebook are:  
- the `rq3_dataset/dataset_training_rq3_<length>_<sample size>.csv` containing presortedness features and labels for a fixed sequence length and sample size.
- the `rq3_model/model_rq3...` the best model for each fixed sequence length and sample size.


## Requirements

Installs the exact version of packages required for this notebook into the current Jupyter kernel.

In [None]:
%pip install scikit-learn==1.7.0
%pip install pandas==2.2.3
%pip install numpy==1.26.0
%pip install tensorflow==2.15.0
%pip install matplotlib==3.10.1

## Create Datasets for Training, Validation, and Test

For each fixed sequence length and sample size, a specific dataset is created that contains presortedness features and labels for a fixed sequence length and sample size.

In order to train the model, it is necessary to determine the **most comparison-efficient sorting algorithm** (*Insertionsort*, *Mergesort*, *Timsort*) for each sequence, which serves as the target label. As input features, we employ the **sampled presortedness metrics** (*Runs* and *Deletions*).  

For the evaluation of the model, we additionally record the **number of comparisons required** to compute the presortedness metrics, as this reflects the computational overhead associated with feature extraction.  

In [None]:
import os
import pickle
import pandas as pd

from intellisorts_training_set import *

dataset_path = 'dataset_sequences/dataset_sequences_10k.pkl'

print(f"Loading sequences {dataset_path} ...")

with open(dataset_path, 'rb') as f:
    dataset_10k_dfs = pickle.load(f)

print("Sequences loaded")

In [None]:
# RQ3 sampling_10_step and sampling_30_step have been used to evaluate optimal sampling strategy and sample size
step10 = sampling_strategy_10_step
step30 = sampling_strategy_30_step

length_and_sampling_strategy = [
    (200, step10),
    (400, step10),
    (600, step10),
    (800, step10),
    (1000, step10),
    (1200, step30),
    (1400, step30),
    (2000, step30)
]
sample_sizes = [5, 50]

for sample_size in range(sample_sizes[0], sample_sizes[1] + 1):
    for length, sampling_strategy in length_and_sampling_strategy:
        df_results = compute_training_data(
            dataset_dfs = dataset_10k_dfs,
            min_length = length,
            max_length = length,
            sampling_strategy = sampling_strategy,
            sample_size = sample_size
        )
        print()
        print(f'Dataset D[{length}, {sample_size}]:', len(df_results))
        df_results.to_csv(f'rq3_dataset/dataset_training_rq3_{length}_{sample_size}.csv')

## Train and Evaluate Neural Network Model

For each fixed sequence length and sample size, model training including hyperparameter optimization in a grid search.

In [None]:
import pandas as pd
import numpy as np

from intellisorts_model_training import grid_search

np.random.seed(42)

data_per_length = {}

for length, _ in length_and_sampling_strategy:
    data_per_length[length] = []

for sample_size in range(sample_size[0], sample_size[1] + 1):
    for length, _ in length_and_sampling_strategy:
        
        # load training dataset
        df_results = pd.read_csv(f'rq3_dataset/dataset_training_rq3_{length}_{sample_size}.csv')
        
        # features
        train_input = df_results[['Deletions', 'Runs','SequenceLength']]
        train_output = df_results['Algorithm']
        
        # perform grid search
        param_grid = {
            'batch_size': [512],
            'epochs': [500],
            'layers': [4,6,8],
            'layersize': [4,6,8]
        }
        
        (
            best_model,
            scaler,
            label_encoder,
            test_accuracy,
            test_indices,
            test_true_algorithms,
            test_predicted_algorithms
        ) = grid_search(
            'rq3_model/model_rq3',
            train_input,
            train_output,
            param_grid
        )

        # evalute comparisons
        test_set_df = df_results.iloc[test_indices]
        test_set_df.reset_index(drop=True, inplace=True)
        
        c = []
        t = []
        
        for index, row in test_set_df.iterrows():
            c.append(row[test_predicted_algorithms[index] + "_Comparisons"])
            t.append(row[test_true_algorithms[index] + "_Comparisons"])
        
        minimum_possible = round(np.sum(t)/len(test_set_df), 1)
        algorithm_prediction_model = round(np.sum(c)/len(test_set_df), 1)
        
        def calculate_average(column_name):
            return round(np.sum(test_set_df[column_name]) / len(test_set_df), 1)

        deletions_comp_dist_avg = calculate_average('Deletions_Comparisons')
        runs_comp_dist_avg = calculate_average('Runs_Comparisons')
        presortedness_average_comp = runs_comp_dist_avg + deletions_comp_dist_avg

        combined_comp = algorithm_prediction_model + presortedness_average_comp
        
        print('Minimum possible: ', minimum_possible)
        print('Algorithm predicting model: ', algorithm_prediction_model)
        print('Presortedness calculation comparisons: ', presortedness_average_comp)
        print('Combined comparisons model: ', round(combined_comp, 1))

        data = data_per_length[length]
        data.append(
            {
                'accuracy': test_accuracy,
                'sort_comparisons': algorithm_prediction_model,
                'presortedness_comparisons': presortedness_average_comp,
                'combined': combined_comp
            }
        )

## Plot the Average Comparisons in Relation to the Sample Size

Shows the average number of comparisons in relation to sample size for  each sequence length. (1) The blue area represents comparisons required to compute presortedness metrics. (2) The orange area shows comparisons required by the predicted sorting algorithm. (3) The blue line represents the total combined comparisons.

In [None]:
import matplotlib.pyplot as plt

for length, _ in length_and_sampling_strategy:
    print()
    print("Sequence Length:", length)
    print()
    
    data = data_per_length[length]
    
    df = pd.DataFrame(data)
    display(df)
    
    # plotting
    plt.figure(figsize=(12, 6))
    
    # density plot for 'combined'
    plt.plot(df.index + 1, df['combined'], label='Additional Presortedness Comparisons', color='blue')
    plt.fill_between(df.index + 1, df['combined'], color='blue', alpha=0.3)
    
    # density plot for 'presortedness_comparisons'
    plt.plot(df.index + 1, df['combined'] - df['presortedness_comparisons'], label='Predicted algorithm Comparisons', color='orange')
    plt.fill_between(df.index + 1, df['combined'] - df['presortedness_comparisons'], color='orange')
    
    plt.grid()
    plt.xlabel('Sample size')
    plt.ylabel('Number of Comparisons')
    plt.title('comparisons of different sample sizes')
    plt.legend()
    
    # show plot
    plt.show()

## Optimal Sample Size at Different Sequence Sizes

Optimal sample sizes in relation to the size of the original sequence size. (1) For sequences up to 1000 elements, every 10th number was sampled (blue graph). (2) For sequences of size larger than 1000 elements, every 30th element was sampled (red graph).

In [None]:
array_sizes = [200,400,600,800,1000]
optimal_sample_size = [13,22,38,45,44]

extended_array_sizes = [1200, 1400, 2000]
extended_optimal_sample_size = [27, 34, 43]

# Plotting
plt.figure(figsize=(12, 6))

plt.plot(array_sizes, optimal_sample_size, marker='o', color='blue', label='Every 10th Number Sampling')

plt.plot(extended_array_sizes, extended_optimal_sample_size, marker='o', color='red', label='Every 30th Number Sampling')

plt.xticks(range(200, 2100, 200))

# Labels and title
plt.xlabel('Sequence Size')
plt.ylabel('Optimal Sample Size')
# plt.title('Optimal Sample Size by Sequence Size')
plt.grid(True)

plt.legend()

# Print plot
plt.rc('font', family='Lato', size=24)  # only applies on a second run

# Show plot
plt.show()