# **Automatic Detection of Hyperparameters for k-NN Models**

This project explores the relationship between dataset characteristics and hyperparameter selection, particularly the best `k` for k-Nearest Neighbors (k-NN). The pipeline includes data generation, feature extraction, and prediction to automate and optimize hyperparameter detection for machine learning models.

<figure>
  <img style="float: left;" src="fig/fig1.mmd.svg"/>
   <figcaption>Pipeline of the processs</figcaption>

</figure>

<br>
<br>

## **Step of the pipeline**

1. **Data Generation**:

   - Generates synthetic datasets with varying characteristics (size, distribution types, noise levels).
   - Saves datasets as CSV files in the `raw_generated_data` folder.




2. **Feature Extraction**:
   - Processes raw datasets to extract relevant features.
   - Calculates key metrics such as noise levels, class counts, and dataset dimensions.
   - Stores processed data as `processed_dataset.csv`.

3. **Prediction**:
   - Builds a regression model to predict the best hyperparameter (`best_k`) based on dataset features.
   - Trains and evaluates the model using scikit-learn's regression utilities.
   - Saves the trained model for future use.


In [2]:
from generation.data_generation import genf_multiple_datasets
genf_multiple_datasets(100)

Clearing raw data folder...
Folder cleared. Generating datasets...
Generating dataset 1/100...
Dataset saved to raw_generated_data/g1.csv
Generating dataset 2/100...
Dataset saved to raw_generated_data/g2.csv
Generating dataset 3/100...
Dataset saved to raw_generated_data/g3.csv
Generating dataset 4/100...
Dataset saved to raw_generated_data/g4.csv
Generating dataset 5/100...
Dataset saved to raw_generated_data/g5.csv
Generating dataset 6/100...
Dataset saved to raw_generated_data/g6.csv
Generating dataset 7/100...
Dataset saved to raw_generated_data/g7.csv
Generating dataset 8/100...
Dataset saved to raw_generated_data/g8.csv
Generating dataset 9/100...
Dataset saved to raw_generated_data/g9.csv
Generating dataset 10/100...
Dataset saved to raw_generated_data/g10.csv
Generating dataset 11/100...
Dataset saved to raw_generated_data/g11.csv
Generating dataset 12/100...
Dataset saved to raw_generated_data/g12.csv
Generating dataset 13/100...
Dataset saved to raw_generated_data/g13.csv
Ge

In [3]:
import os

directory = "raw_generated_data"
files = os.listdir(directory)

print(files)

['g100.csv', 'g28.csv', 'g14.csv', 'g15.csv', 'g29.csv', 'g17.csv', 'g16.csv', 'g12.csv', 'g13.csv', 'g11.csv', 'g39.csv', 'g38.csv', 'g10.csv', 'g88.csv', 'g63.csv', 'g77.csv', 'g8.csv', 'g9.csv', 'g76.csv', 'g62.csv', 'g89.csv', 'g48.csv', 'g74.csv', 'g60.csv', 'g61.csv', 'g75.csv', 'g49.csv', 'g71.csv', 'g65.csv', 'g59.csv', 'g58.csv', 'g64.csv', 'g70.csv', 'g99.csv', 'g66.csv', 'g72.csv', 'g73.csv', 'g67.csv', 'g98.csv', 'g81.csv', 'g95.csv', 'g42.csv', 'g56.csv', 'g1.csv', 'g57.csv', 'g43.csv', 'g94.csv', 'g80.csv', 'g96.csv', 'g82.csv', 'g69.csv', 'g55.csv', 'g41.csv', 'g2.csv', 'g3.csv', 'g40.csv', 'g54.csv', 'g68.csv', 'g83.csv', 'g97.csv', 'g93.csv', 'g87.csv', 'g50.csv', 'g44.csv', 'g78.csv', 'g7.csv', 'g6.csv', 'g79.csv', 'g45.csv', 'g51.csv', 'g86.csv', 'g92.csv', 'g84.csv', 'g90.csv', 'g47.csv', 'g53.csv', 'g4.csv', 'g5.csv', 'g52.csv', 'g46.csv', 'g91.csv', 'g85.csv', 'g21.csv', 'g35.csv', 'g34.csv', 'g20.csv', 'g36.csv', 'g22.csv', 'g23.csv', 'g37.csv', 'g33.csv', 'g27.c