This is the repository for the first assignment of the 'Machine Learning I' course, which focuses on implementing changes in a machine learning algorithm.
Group: Enrico Pacheco Sanchez | Enzo Nascentes Grigório | Matheus Goncalves Guerra
In the initial phase of this project, we established a robust experimental baseline by using a manual k-Nearest Neighbors (kNN) classifier. Our focus was on analyzing the model's behavior when exposed to Group 1: Noise and Outliers. To ensure a deep understanding of the algorithm's mechanics, we avoided high-level library implementations and developed the logic for Euclidean distance and majority voting using as base the implementation in: https://github.com/rushter/MLAlgorithms/tree/master .
Before running the benchmark, we developed an automated pipeline to handle diverse datasets:
- Label Encoding: Categorical features and target labels were converted into numerical formats to allow for distance calculations.
- Missing Values: Gaps in the data were handled through mean imputation to maintain dataset integrity.
- Min-Max Scaling: All features were normalized to a
$[0, 1]$ range. This is critical for kNN as it prevents features with larger numerical scales from disproportionately influencing the distance metrics.
We conducted an extensive empirical study across approximately 50 noisy datasets:
- 10-fold Cross-Validation: To ensure statistical stability, every dataset was split into 10 folds, training on 9 and testing on 1, repeating the process 10 times.
- Hyperparameter Tuning: We evaluated
$k$ values of$1, 3, 5,$ and$11$ to observe the trade-off between local sensitivity and global smoothing. - Performance Metrics: Results were recorded as Mean Accuracy accompanied by the Standard Error of the Mean (SEM) to measure the consistency of the model across different folds.
To prove the scientific validity of our findings, we applied the Friedman Test. This non-parametric test confirmed that the differences in performance between the chosen