# Project 3: Non-Parametric Methods and Unsupervised Learning for Parkison's Detection Based on Vocal Speech Recordings

Project 3
- Enrique Almazán Sánchez
- Judith Briz Galera

# Introduction

In the culminating phase of our exploration into the realm of Machine Learning and its application in Parkinson's disease detection, Project 3 marks the integration of non-parametric methodologies and advanced embedding techniques. Having meticulously curated and preprocessed our dataset in Project 1, and delved into the world of parametric regression and classification in Project 2, we are now poised to unlock further potential using algorithms designed to capture intricate patterns and relationships.

Project 3 extends our journey by incorporating non-parametric methods such as K-Nearest Neighbors (KNN) and Decision Trees, alongside powerful embedding methods like Random Forest Classifier (RF), and Gradient Boosting (GTB), to end with Multilayer Perceptron (MLP). These methodologies promise to unravel complex dependencies within the dataset, fostering a deeper understanding of the features contributing to Parkinson's disease.

Additionally, we venture into the realm of unsupervised techniques, employing K-Means clustering and Hierarchical Agglomerative Clustering to uncover hidden structures and natural groupings within the data. This holistic approach seeks to enhance both the accuracy and interpretability of our Parkinson's disease detection model, setting the stage for a comprehensive analysis that transcends traditional parametric boundaries.

As we embark on this final chapter, our aim is not only to refine the detection capabilities of our model but also to contribute valuable insights that may impact the diagnosis and management of Parkinson's disease. The synthesis of non-parametric and embedding techniques represents a crucial step forward in our pursuit of leveraging AI for the betterment of healthcare and the well-being of individuals affected by Parkinson's.

# Previous concerns

<div class="alert alert-block alert-danger">Reflecting on the preceding stages of our project journey, a notable concern that emerged from the implementation of parametric classification, using logistic regression, revolves around the modest performance outcomes. The attained results failed to surpass the probability of a random coin toss, with the best-performing models yielding accuracy metrics falling within the range of 0.50 to 0.58. This raises a critical issue as the effectiveness of the parametric approach, crucial in earlier phases, seems to highlight at suboptimal levels.

Such mediocre results underscore the urgency for a paradigm shift in our methodology, prompting the exploration of non-parametric methods and advanced embedding techniques. The primary objective now becomes the enhancement of classification accuracy and interpretability, with the aim to surpass the limitations encountered during the parametric classification phase. The unique concern for Project 3 is thus centered on elevating the performance metrics to levels that significantly outpace the arbitrary chance, fostering a more reliable and clinically relevant Parkinson's disease detection model.
</div>

<div class = "alert alert-block alert-info">In our quest to address the suboptimal results observed in previous phases, distinct analysis procedures will be employed. Firstly, we will adhere to the established method used in previous projects, aiming to investigate whether the persistently poor outcomes can be attributed to the methodology itself. This comparative analysis will help pinpoint potential limitations inherent in the parametric classification method, such as logistic regression, which has been the cornerstone of our initial explorations.

Additionally, we will explore new perspectives by employing two different cross-validation approaches: Leave One Group Out (LOGO) and Summarized Leave One Out (SLOO). Each of these cross-validation methods offers a unique approach to assessing model performance. The implementation of these procedures will be carried out in separate Jupyter notebooks, ensuring a clear separation and documentation of each step. This meticulous approach will not only facilitate result comparisons between methods but also enable an evaluation of their suitability and effectiveness in the context of Parkinson's disease detection.</div>

# Objectives

The overarching goal of Project 3 is to advance the understanding and application of non-parametric methods and advanced embedding techniques in the context of Parkinson's disease detection. By integrating some of the methods previously mentioned, the project aims to enhance the accuracy and interpretability of classification tasks associated with Parkinson's disease diagnosis, as well as improve the results which we obtained using parametric methods.

## Non-Parametric Methods

In the exploration of non-parametric methods, we will delve into the intricacies of each algorithm, understanding their nuances, and addressing the proposed practice questions, such as the possible normalization of the features for each of them or the hyperparameters considered for their application.

### 1. k-Nearest Neighbors (KNN)

<div class = "alert alert-block alert-info">KNN is a simple, non-parametric algorithm used for classification and regression tasks. It classifies a data point based on the majority class of its k-nearest neighbors in the feature space. The value of k is a user-defined hyperparameter.</div>

- **Normalization of Features**: The performance of a k-NN scheme can be affected by the scale of the features. If the features are not normalized, those with larger scales may dominate the distance calculation. Normalizing variables (scaling them to be in the same range) ensures that each feature contributes proportionately to the distance metric.
    If the design and performance evaluation of a k-NN scheme are done with non-normalized variables, the performance may be suboptimal. Normalization is crucial to ensure fair and effective comparisons between features, leading to a more accurate representation of the underlying data relationships and improving the model's performance.
    
***Hyperparameter 'k'***

The hyperparameter k in kNN determines the number of nearest neighbors considered when making predictions for a new, unseen data point. Specifically, for a given test observation, the algorithm identifies the k training data points that are closest in feature space and aggregates their target values to make a prediction.

- **Estimating Target Value in k-NN**: The target value for a specific test observation is estimated by taking a majority vote (as we are in a classification task) of the target values of its k nearest neighbors in the feature space. In classification, the class with the most occurrences among the k neighbors is assigned to the test observation.

- **Finding the Most Suitable Value for k**: Selecting the most suitable value for k involves a trade-off. Smaller values of k result in predictions that are more sensitive to noise and outliers, potentially leading to overfitting. Larger values of k provide smoother and more robust predictions but may oversimplify the model. The optimal k depends on the specific characteristics of the dataset and the underlying data distribution. Common approaches include cross-validation to assess performance with different k values and choosing the value that minimizes the validation error.

- **Relationship between Generalization Capability and k**: The choice of k has a significant impact on the generalization capability of the kNN model. Smaller values of k tend to result in more flexible models that can capture intricate patterns in the training data but may lead to overfitting. Larger values of k generally promote smoother decision boundaries, enhancing generalization to new, unseen data, but might oversimplify the model. The optimal k for generalization is often found by balancing these trade-offs through experimentation and validation.


***Hyperparameter p***

The hyperparameter p is associated with the choice of the distance metric used in the KNN algorithm. KNN can use different distance metrics to measure the similarity between data points. The most common values for p are 1 and 2.

   - When p = 1: It corresponds to the Manhattan distance (L1 norm), where the distance between two points is the sum of the absolute differences of their coordinates. This is often referred to as the "L1 norm" or "Manhattan distance."
   - When p = 2: It corresponds to the Euclidean distance (L2 norm), where the distance between two points is the straight-line distance between them. This is the default choice in scikit-learn's KNeighborsClassifier.


***Hyperparameter algorithm***

The `algorithm` hyperparameter in KNN specifies the algorithm used to compute nearest neighbors. It influences the efficiency of the KNN algorithm.

- `'auto'`: Automatically selects the most appropriate algorithm based on the input data and other parameters. It is a good default choice.
- `'ball_tree'`: Uses a Ball Tree data structure to organize and search for neighbors efficiently.
- `'kd_tree'`: Uses a KD-Tree data structure for neighbor search.
- `'brute'`: Performs a brute-force search and computes distances for all data points. It can be inefficient for large datasets but is suitable for smaller datasets.


***Hyperparameter weights***

The `weights` hyperparameter controls how much influence each of these neighbors has on the classification of the new point.

- Uniform weights: Each neighbor has an equal vote in the decision-making process. Regardless of how far away a neighbor is, its contribution to the classification decision is the same.
- Distance weights: The contribution of each neighbor is weighted by its distance to the new data point. Closer neighbors have a stronger influence on the classification decision than neighbors that are farther away. This is typically implemented by assigning weights inversely proportional to the distance. The closer the neighbor, the higher its weight.

### 2. Decision Tree Classifier (DTC)

<div class = "alert alert-block alert-info">Decision Trees are a tree-like model where each internal node represents a decision based on the value of a particular feature, and each leaf node represents the outcome (class label). Decision Tree Classifier is used for classification tasks.</div>

- **Normalization of Features**: Decision trees are generally not sensitive to the scale of features. Decision tree splits are based on feature values and their relative ordering rather than their absolute magnitudes. Therefore, normalizing features is often unnecessary when using decision trees. Normalization might be more critical for distance-based algorithms like k-NN, where the distance calculation is affected by the scale of features. However, for decision trees, the inherent structure of the tree is not affected by feature scales.


***Hyperparameter max_depth***

The `max_depth` hyperparameter controls the maximum depth of the decision tree. Controlling the maximum depth is crucial for preventing overfitting. Smaller values for `max_depth` can lead to simpler trees and potentially better generalization on unseen data.

- Integer:  If an integer is provided, it limits the depth of the tree to the specified value. Nodes beyond this depth are not expanded.
- `None`: If set to `None`, nodes are expanded until they contain less than `min_samples_split` samples or until all leaves are pure.

Also, a rule of thumb to take into account regarding this hyperparameter is:

- For small datasets, you may want to limit the depth of the tree to prevent overfitting.
- For large datasets, allowing greater depth can be useful for capturing more complex relationships.


***Hyperparameter minimum number of samples per split***

The hyperparameter `minimum number of samples per split` (often denoted as `min_samples_split` or similar) in decision trees determines the minimum number of samples required to split an internal node further. Evaluating designs with different values for this hyperparameter is crucial for controlling the complexity and depth of the tree. The rule of thumb commonly applied is to avoid excessively deep trees, which might lead to overfitting.

1. **Small Values (e.g., 2 to 10):** Can capture fine details in the training data, potentially leading to more accurate predictions. However, higher risk of overfitting, especially if the dataset has noise or outliers.

2. **Moderate Values (e.g., 20 to 50):** Balances model complexity, reducing the risk of overfitting. However, it may miss some finer details in the data.

3. **Large Values (e.g., 100 or more):** Simpler models, less prone to overfitting. However, it might oversimplify the model, potentially leading to underfitting.

The choice of the minimum number of samples per node should be guided by a balance between capturing enough details from the training data and preventing overfitting. Smaller values allow the tree to be more flexible, while larger values promote simplicity. The specific value depends on the characteristics of the dataset. Cross-validation can be employed to assess the model's performance with different hyperparameter values and choose the one that provides the best balance between bias and variance.


***Hyperparameter minimum number of samples per node***

The `min_samples_leaf` hyperparameter is used to control the size of these leaf nodes. Specifically, it specifies the minimum number of samples that must be present in a leaf node. If, after a split, creating a child node would result in a leaf node with fewer samples than specified, the algorithm would not perform the split, and the current node would become a leaf node.

Setting a higher value for min_samples_leaf can have several effects:

1. **Regularization**: Larger values of min_samples_leaf can help prevent overfitting by avoiding the creation of nodes that are too specific to the training data.

2. **Smoother Decision Boundaries**: Larger leaf nodes may lead to decision boundaries that are smoother and less prone to capturing noise in the training data.

3. **Improved Generalization**: A tree with larger leaf nodes may generalize better to unseen data because it focuses on broader patterns in the data.

Also, for the last two explained hyperparameters the following rule of thumb is highlighted, and will be taken into account:

- For small datasets, consider increasing the value of min_samples_split and min_samples_leaf to avoid splits and leaves that are too small, which may lead to overfitting, but taking into account that it cannot be too big as it can stop the tree to continue dividing too soon.
- For large datasets, you can decrease these values to allow for more splits and smaller leaves, which can help the model capture finer patterns, but taking into account that it cannot be too small as it can overfit, meaning that it wont be a good predictor for unseen data.


***Hyperparameter criterion***

The `criterion` hyperparameter in a Decision Tree Classifier determines the criterion used for measuring the quality of a split at each node.

- `'gini'`: Uses the Gini impurity as the criterion. It measures how often a randomly chosen element would be incorrectly classified. Gini impurity is suitable for classification problems and is the default criterion in scikit-learn's DecisionTreeClassifier.
- `'entropy'`: Uses information gain as the criterion. It measures the reduction in entropy or disorder. Entropy is suitable for decision trees in information theory and is another common criterion for Decision Trees.


***Hyperparameter splitter***

The `splitter` hyperparameter determines the strategy used to choose the split at each node.

- `'best'`: Chooses the best split based on the selected criterion. It evaluates all possible splits and selects the one that maximizes information gain or minimizes impurity.
- `'random'`: Chooses a random split. It can be useful for preventing overfitting by introducing randomness into the tree-building process.

### 3. Random Forest Classifier (RFC): 

<div class = "alert alert-block alert-info">Random Forest is an ensemble learning method that constructs a multitude of decision trees during training and outputs the mode of the classes (classification) of the individual trees. It helps to improve the accuracy and control over-fitting.</div>

- **Normalization of Features**: Random Forests build decision trees, and these trees make decisions based on feature thresholds. The order of magnitude or scale of features does not impact the decision-making process in tree-based models. Therefore, normalization is not strictly necessary for RFC. However, as our idea is to compare its performance with different models that are sensitive to feature scales (K-NN, GBT, or MLP), normalizing features might provide a fair comparison.

It is worthmentioning that the RFC shares various hyperparameters with the DTC as expected, such as:

- ***max_depth***
- ***min_samples_split***
- ***criterion***

which have already being explained. However, the following will also be considered:


***Hyperparameter n_estimators***

The n_estimators hyperparameter represents the number of trees to be created in the forest. Each tree in the forest gets a vote when making a prediction, and the final prediction is determined by a majority vote (for classification) or an average (for regression) among the trees.

A higher value of n_estimators generally leads to a more robust and stable model, as it reduces the impact of individual noisy trees. However, there is a diminishing return, and at some point, the improvement in performance may plateau or even decrease, while the computational cost increases.


### 4. Gradient Boosting Classifier (GBT): 

<div class = "alert alert-block alert-info">Gradient Boosting is an ensemble learning technique that builds a series of weak learners (usually decision trees) sequentially. Each tree corrects the errors of the previous one, improving the overall predictive performance. It's often used for classification tasks.</div>

- **Normalization of Features**: Gradient Boosting, especially when using shallow trees as weak learners, may be sensitive to feature scales. Boosting algorithms, in general, tend to emphasize the importance of features that have larger numerical values. Therefore, normalizing features might help Gradient Boosting perform better. Also, normalizing features can help the boosting algorithm converge faster during the optimization process. It may result in a more efficient training process, especially when using algorithms like Gradient Boosting.


Again, as it is logic, some hyperparameters from the DTC and RFC are used, such as:

- ***n_stimators***
- ***max_depth***

using also:


***Hyperparameter learning_rate***

The learning_rate hyperparameter controls the contribution of each weak learner to the ensemble. It scales the contribution of each tree before adding it to the ensemble. Lower values result in a slower learning process but may improve model generalization.

### 5. Multi-Layer Perceptron (MLP)

<div class = "alert alert-block alert-info">MLP is a type of artificial neural network with multiple layers of nodes (neurons) in a feedforward fashion. It consists of an input layer, one or more hidden layers, and an output layer. MLP is used for a variety of tasks, including classification, regression, and pattern recognition.</div>

- **Normalization of Features**: Normalizing features is generally recommended for MLPs. The activation functions and weight updates in MLPs are sensitive to the scale of input features. Normalization ensures that all features contribute more equally to the learning process and aids in faster convergence. Thus, without normalization, training may be slower, and the model may struggle to converge. The lack of feature normalization might lead to uneven contributions from features, affecting the model's ability to learn effectively.

***The MLP is a Universal Approximator***: This statement refers to the theoretical capacity of a multi-layer perceptron with a sufficient number of neurons to approximate any continuous function. While a single hidden layer with one neuron can represent some simple functions, the universality property becomes more apparent and powerful as the network's capacity (number of neurons and layers) increases.

- For a single hidden layer with one neuron, evaluating the model's performance using standard metrics like accuracy, precision, recall, and F1 score would be informative. Additionally, examining learning curves, confusion matrices, and ROC curves can provide insights into the model's behavior.


***Hyperparameter: Hidden Layer Sizes***:

Regarding the underfitting and overfitting considerations with a single Hidden Layer with One Neuron in MLP:

   - **Likelihood of Overfitting:** With a single hidden layer and only one neuron, the model's capacity is severely limited. It may struggle to capture the complexity of the underlying relationships in the data. Overfitting is less likely because the model lacks the flexibility to fit the training data too closely.

   - **Likelihood of Underfitting:** Underfitting is more likely due to the simplicity of the model. The single neuron in the hidden layer may struggle to learn complex patterns in the data, resulting in a model that fails to adequately represent the task.
   
A single hidden layer with one neuron is generally not recommended for complex tasks. However, it might be reasonable for extremely simple tasks or initial exploration. For more complex tasks, increasing the number of neurons and layers is typically necessary to allow the model to learn intricate patterns.

In summary, a single hidden layer with one neuron in an MLP is likely to result in underfitting due to its limited capacity. While it may be reasonable for very simple tasks, expanding the architecture is typically necessary for more complex problems. Normalizing features is advisable to improve the training process and the model's ability to learn from the data.

        
***Hyperparameter activation***:

The `activation` hyperparameter in MLP determines the activation function used in the neurons of the neural network.

 - `'identity'`: Linear activation function ( \(f(x) = x\) ). It is used for regression problems where the output needs to be a linear combination of the input features.
 - `'logistic'`: Logistic sigmoid activation function ( \(f(x) = \frac{1}{1 + e^{-x}}\) ). It is commonly used for binary classification problems as it squashes the output between 0 and 1, representing probabilities.
 - `'tanh'`: Hyperbolic tangent activation function ( \(f(x) = \tanh(x)\) ). It squashes the output between -1 and 1 and is often used for hidden layers in neural networks.
 - `'relu'`: Rectified Linear Unit activation function ( \(f(x) = \max(0, x)\) ). It is widely used in hidden layers for introducing non-linearity.


***Hyperparameter alpha***:

The `alpha` hyperparameter in MLP represents the L2 regularization term applied to the weights of the neural network. Higher values of `alpha` result in stronger regularization, penalizing large weights in the network. Regularization helps improve the generalization performance of the model on unseen data, when dealing with neural networks to avoid overfitting, especially when the model has a large number of parameters.

## Unsupervised Learning

K-Means clustering will be employed with different values of k, justified by a chosen performance metric, and potential preprocessing considerations will be discussed. 

Hierarchical agglomerative clustering will explore multiple linkage options, with subsequent analysis of dendrograms to determine the most suitable linkage type and the number of groups.

The comparison between partitional and hierarchical clustering will be conducted, and the obtained patterns will be interpreted to further contribute to the comprehensive understanding of the dataset. These methodologies collectively contribute to our overarching goal of advancing Parkinson's disease detection through the lens of non-parametric methods and unsupervised learning.


### 1. K-Means

K-means is a clustering algorithm that partitions a dataset into K distinct clusters. Each data point belongs to the cluster with the nearest mean, and the mean serves as a centroid of the cluster. The algorithm proceeds through iterative steps, adjusting cluster assignments and centroids to minimize the sum of squared distances between data points and their respective cluster centroids.

**Normalization of features:** K-means is an algorithm very sensitive to data standardization. Features with larger scales can dominate the calculation of distances, leading to biased results. Standardizing features ensures that all features contribute equally to the clustering process.



### 2. Hierarchical clustering

Agglomerative Hierarchical Clustering is a hierarchical clustering algorithm that organizes similar objects into clusters in a tree-like structure. The process begins by treating each data point as an individual cluster. Through successive iterations, the algorithm merges the closest clusters according to a specified linkage criterion until all data points are part of a unified cluster.

**Normalization of features:** Hierarchical clustering is a more robust method towards data standardization. However, to facilitate the visualization steps, we will also be using the standardized data.

## Proposed metrics

### For classification tasks through non-parametric methods:

1. **Accuracy:** ratio of correctly predicted instances to the total instances. It provides an overall performance measure, suitable when classes are balanced, where all classes have a similar representation in the data. However, it can be misleading in problems with imbalanced classes, where high accuracy may be inflated by good performance in the majority class.


2. **Sensitivity (Recall):** measures the proportion of actual positive instances correctly predicted by the model by evaluating the proportion of correct positive predictions out of the total positive predictions. It is especially useful when minimizing false positives is desired, such as instances wrongly classified as positive when they are actually negative (in medical diagnoses). For example, in a disease detection problem, precision indicates the proportion of patients correctly classified as sick out of the total patients classified as sick.


3. **Precision:** ratio of correctly predicted positive observations to the total predicted positives. It is useful for problems where minimizing false negatives is desired, i.e., positive instances wrongly classified as negative (in spam detection). For example, in a spam detection system, recall indicates the proportion of correctly identified spam emails out of the total spam emails.


4. **F1 Score:** harmonic mean of precision and recall, providing a balance between the two metrics. It is especially useful when seeking a balance between minimizing false positives and false negatives. The F1 score is commonly used in classification problems with imbalanced classes.


5. **AUC-ROC (Area Under the ROC Curve):** a graphical representation of the trade-off between sensitivity and specificity. AUC-ROC measures the area under this curve. Useful for assessing the model's ability to discriminate between positive and negative instances.


Also, the following will also be presented as a tool to evaluate each of the model performances.

- **Confusion Matrix:** A table used to evaluate the performance of a classification algorithm, showing the counts of true positive, true negative, false positive, and false negative. Provides a detailed breakdown of the model's performance.


- **Probability Density Graph:** A graph that represents the probability distribution of predicted probabilities for each class. Helps to visualize the certainty of the model's predictions and understand the distribution of confidence.


### For unsupervised learning tasks:

1. **'Elbow' method:** plot of the sum of squared distances against k, used in K-means to find the optimal number of clusters between a given range of values of k.

2. **Silhouette:** applied with K-means implementation, to evaluate the appropriateness of the chosen number of clusters. Its value ranges from -1 to 1, the higher it is the most appropriately assigned the data point is to its cluster. A score of 0 indicates that the data point is on the decision boundary between two neighboring clusters.

3. **Cophenetic correlation coefficient:** Used in hierarchical clustering to find the best linkage criteria to be used. A high cophenetic correlation coefficient indicates that the hierarchical clustering has effectively captured the underlying pairwise relationships between data points.

## Dependencies (Required Libraries)

In the following cell, we import all the necessary libraries for the project.

- **Pandas**: A Python library used for data analysis and manipulation. It provides flexible and efficient data structures, such as DataFrames, for working with tabular datasets. Pandas offers a wide range of functions and methods for cleaning, transforming, and exploring data, making the data preparation process easier before applying machine learning algorithms.

- **NumPy**: A fundamental library for scientific computing in Python. It provides a data structure called a multidimensional array (ndarray) that allows for efficient operations on data arrays. NumPy is widely used in numerical analysis and data processing, providing functionality for mathematical operations, array manipulation, and statistical calculations.

- **scipy.stats**: Python library within SciPy that focuses on statistical functions and probability distributions. It offers tools for working with probability distributions, statistical tests, random variables, descriptive statistics, and modeling. It's a versatile library used for statistical analysis and hypothesis testing in scientific research and data analysis.

- **Matplotlib**: A data visualization library in Python. It provides a wide range of functions and methods for creating static plots, such as line plots, bar charts, scatter plots, and contour plots. Matplotlib is highly customizable and allows for adding labels, titles, legends, and other annotations to plots. It is a popular tool for data visualization in data analysis and result presentation.

- **Plotly**: An interactive data visualization library for Python. It enables the creation of interactive and dynamic charts, including scatter plots, line charts, bar charts, and surface plots. Plotly offers a web-based user interface for exploring and manipulating charts, making it easy to create interactive visualizations and present data.

- **Seaborn**: A data visualization library based on Matplotlib. It provides a high-level interface for creating attractive and concise statistical plots. Seaborn simplifies the creation of distribution plots, regression plots, correlation plots, and other common chart types in data analysis. Additionally, Seaborn offers predefined styles and color palettes that enhance the appearance of charts.

- **Scikit-learn (sklearn)**: An open-source machine learning library for Python. It provides a wide range of algorithms and tools for performing machine learning tasks, such as classification, regression, clustering, and feature selection. Scikit-learn stands out for its ease of use and focus on efficiency and scalability. In addition to algorithms, the library also offers utilities for model evaluation, cross-validation, and data preprocessing.

# Conclussions and Discussion

## Non-parametric methods

The results, unfortunately, fall short of any meaningful clinical application. In some instances, the models perform slightly better than a coin toss, emphasizing the inadequacy of the methodologies employed. The phrase "garbage in, garbage out" resonates, suggesting that the dataset itself may be a limiting factor. It is imperative to explore more robust methods, such as deep neural networks, to discern whether the issues lie in the methodologies or if the data itself is inherently unsuitable for the intended task. These lackluster results call for a fundamental reevaluation of the modeling approach to ensure that any predictive model for Parkinson's disease detection is both reliable and clinically relevant.

### 1st approach: following previous projects

**K-Nearest Neighbors (KNN):**
KNN yields worse than chance results, with accuracy lower than 50%. While sensitivity is relatively high, the lack of specificity underscores the limited practical utility of the model. These results are far from satisfactory, as it predicts everyhting as positive.

**Decision Tree Classifier (DTC):**
DTC's perfect accuracy on the training set is misleading, as the model struggles with generalization, being even worse than a coin toss. The overfitting observed raises serious concerns about the model's robustness.

**Random Forest Classifier (RFC):**
Similar to DTC, RFC exhibits overfitting tendencies, with high training accuracy but a substantial drop in performance on the test set. The inconsistency and variability in results further emphasize the model's lack of reliability.

**Gradient Boosting Classifier (GBT):**
GBT's performance, while slightly above 50%, is far from robust. The marginal improvement over a coin toss does little to instill confidence in its predictive capabilities for Parkinson's disease detection.

**Multilayer Perceptron (MLP):**
MLP's accuracy around 50% indicates a random-like predictive behavior. The model's inability to discern patterns within the data is evident, rendering it unsuitable for practical diagnostic applications. It has a similar problem as KNN, but predicting everything as negative, ergo, having high specificity while low sensitivity.

**Voting Classifier (VCC - Ensemble Method):**
Despite high accuracy on the training set, the ensemble method struggles to generalize, with performance barely surpassing 50% on the test set. The supposed strength of ensemble learning is not evident in these lackluster results.

![image.png](attachment:image.png)


### 2n approach: LOGO

The LOGO cross-validation approach provides a stark assessment of the models' performance. While average accuracies are similar to those obtained in the 1st approach, the significant variability in results across different test folds (seen also in the graphs presented in the respective docoument) is alarming. The lack of consistency suggests that the models struggle to maintain predictive power across diverse subsets of the dataset. These results emphasize that the models' performance is not robust and can be heavily influenced by the specific composition of the test set. The inability to consistently outperform random chance indicates fundamental issues in the models' ability to generalize effectively, which further emphasizes the models' unreliability.


### 3rd approach: s-LOO

In the case of s-LOO, the picture remains grim. Again, the substantial variability, which indeed increased (due to the summarizing of our observations), indicates that the models' predictions are highly sensitive to the exclusion of individual samples. The inconsistency across different combinations of samples further highlights the lack of stability in the models. The results from s-LOO underscore the challenges in achieving reliable and reproducible predictions, pointing towards a fundamental inadequacy in the models' ability to capture meaningful patterns within the data. This further underscores the overall weakness of the models.


Thus, both LOGO and s-LOO reveal the models' limitations in maintaining predictive performance across various test scenarios. The significant variability in results suggests that the models struggle with robust generalization, raising doubts about their reliability in real-world applications. These findings strengthen the case for a more rigorous exploration of modeling techniques, potentially including more sophisticated approaches like deep neural networks, to discern whether the observed issues stem from the methodologies employed or the inherent complexities of the dataset.

![image-2.png](attachment:image-2.png)


## Unsupervised Learning

After the application of two types of clustering algorithms on our dataset, we reached almost identical results with both. Deciding which algorithm is better is a tough decision, since none of them worked efficiently enough. Despite this, we could highlight some advantages and disadvantages that we observed of both methods.

First of all, it is worth mentioning that regarding dimensionality reduction in the preprocessing of data resulted in different cluster numbers for Kmeans than Hierarchical clustering. 

In K-means, dimensionality reduction provided better results, although still not good enough, and the number of clusters did not vary regarding the dimensional reduciton.

In hierarchical clustering, dimensionality reduction also improved the silhouette coefficient, but this time changing the optimal number of clusters to 4. This is the only different clustering result obtained.

From this, we could conlcude that dimensionality reduction helps to provide more clear assignations in both K-means and hierarchical algorithms.

Now getting onto the algorithms comparison, we reached the following conclusions as to each method:

Regarding K-means, its linear time complexity makes it well-suited for handling large datasets like ours (which also has a high number of dimensions). Its simplicity and ease of implementation, along with the straightforward partitioning of data into clusters, contribute to its efficiency. However, K-means assumes clusters with equal variance and similar sizes, which is not optimal since the structures are complex. We would decide to choose this algorithm because its results with and without dimensionality reduction are more consistent, and the result obtained with the dimensionality reduction was the highest, although still not even reaching the 0.5.

On the other hand, Agglomerative Hierarchical Clustering provides a more flexible framework for exploring hierarchical relationships within the data. It does not require specifying the number of clusters beforehand and allows for soft assignments of data points to multiple levels of the hierarchy. This flexibility makes it very useful due to the complex structures in our high-dimensional dataset. However, it is computationally more complex which is tedious for such large dataset. Thanks to the hierarchy it helps to interpret different levels of granularity, providing a more nuanced understanding of the data structure. 

Unfortunately, the results obtained are not accurate enough to reach a solid conlusion. It would be easier if our data was more clearly defined. By this we mean that our dataset has too many dimensionalities and data might be superposed, making it difficult to approrpiately distinguish clear clusters as a hard answer. If applied to a different dataset we might be able to reach clearer and more justified conclusions as to which method is better.