### Imports - Bibliotecas


In [None]:
import pandas as pd
import numpy as np
from sklearn.model_selection import StratifiedKFold, cross_val_score
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.preprocessing import MinMaxScaler
from sklearn.pipeline import make_pipeline
from scipy import stats
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score 

### Carregar dados dos ficheiros auxiliares

In [42]:
# Load the dataset
# Assuming the dataset is in CSV format and contains the features and the target
df = pd.read_csv('heart-disease.csv')

# Separate features and target
X = df.drop('target', axis=1) 
y = df['target']

In [43]:
# Set up stratified 5-fold cross-validation
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=0)

# Define classifiers
knn = KNeighborsClassifier(n_neighbors=5)
nb = GaussianNB()

### Exercício 1a)

In [None]:
# Perform cross-validation
knn_accuracies = cross_val_score(knn, X, y, cv=cv, scoring='accuracy')
nb_accuracies = cross_val_score(nb, X, y, cv=cv, scoring='accuracy')

# Boxplots
plt.figure(figsize=(8,6))
sns.boxplot(data=[knn_accuracies, nb_accuracies], palette="Set2")
plt.xticks([0, 1], ['KNN (k=5)', 'Naive Bayes'])
plt.title('KNN vs Naive Bayes Cross-Validation Accuracies')
plt.ylabel('Accuracy')
plt.show()

# Print fold accuracies
print(f"KNN accuracies: {knn_accuracies}")
print(f"Naive Bayes accuracies: {nb_accuracies}")


Gaussian Naive Bayes (GNB) shows a higher average accuracy compared to k-Nearest Neighbors (kNN) with 𝑘 = 5.

GNB appears more stable as the interquartile range is much narrower compared to kNN, which indicates less variability in the accuracy across different folds.

On the other hand, kNN shows more variability, with a wider box and longer whiskers, suggesting that its performance fluctuates more across the cross-validation folds.

The Naive Bayes classifier is not only more accurate but also more stable than kNN in this scenario. The higher stability could be attributed to the probabilistic nature of Naive Bayes, which is less affected by small variations in the data compared to kNN, which relies on distances and local neighborhoods.


### Exercício 1b)

In [None]:
# Define 5-fold Stratified Cross-Validation
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=0)

# Scale the data using Min-Max Scaler
scaler = MinMaxScaler()
X_scaled = scaler.fit_transform(X)  # Assuming X is your feature matrix

# Initialize the models
knn = KNeighborsClassifier(n_neighbors=5)
nb = GaussianNB()

# Cross-validate and store accuracies
knn_scaled_scores = cross_val_score(knn, X_scaled, y, cv=cv)
nb_scaled_scores = cross_val_score(nb, X_scaled, y, cv=cv)

# Create boxplot for accuracies across folds (Scaled Models)
data = [knn_scaled_scores, nb_scaled_scores]
labels = ['kNN (Scaled)', 'Naive Bayes (Scaled)']

plt.boxplot(data, tick_labels=labels)
plt.ylabel('Accuracy')
plt.title('Accuracy Boxplot (Scaled Models)')
plt.show()

# Print mean accuracies
mean_accuracies = [np.mean(knn_scaled_scores), np.mean(nb_scaled_scores)]
print(f'Mean accuracy of KNN (k=5) after Min-Max scaling: {mean_accuracies[0]:.4f}')
print(f'Mean accuracy of Gaussian Naive Bayes after Min-Max scaling: {mean_accuracies[1]:.4f}')


K-Nearest Neighbors (kNN):

The mean accuracy of 0.8217 indicates a significant improvement in performance compared to the previous accuracy when the data wasn't scaled. As noted earlier, kNN’s performance relies heavily on the distance between points in the feature space. By scaling to the same range, we allow the kNN algorithm to function more effectively, resulting in the observed accuracy.

Gaussian Naive Bayes (GNB):

The mean accuracy of 0.8350 shows that GNB performs slightly better than kNN in this case. GNB's performance benefits from Min-Max scaling in terms of ensuring that the numerical range of the features does not lead to computational issues. However, because GNB operates under different assumptions about the distribution of data, its performance is inherently more stable regardless of scaling.

### Exercício 1c)

In [None]:
# Perform paired t-test
t_stat, p_value = stats.ttest_rel(knn_accuracies, nb_accuracies, alternative = 'greater')

# Report the result
print(f"T-statistic: {t_stat}, P-value: {p_value}")\

# Evaluate the hypothesis
alpha = 0.05
if p_value < alpha:
    print("Reject the null hypothesis: KNN is statistically superior to Naive Bayes.")
else:
    print("Fail to reject the null hypothesis: No significant difference between KNN and Naive Bayes.")


### Exercício 2a)

In [None]:
# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

# K values to test
k_values = [1, 5, 10, 20, 30]

# Dictionaries to store accuracies
train_accuracies_uniform = []
test_accuracies_uniform = []
train_accuracies_distance = []
test_accuracies_distance = []

# Loop over each k-value
for k in k_values:
    # Uniform weights
    knn_uniform = KNeighborsClassifier(n_neighbors=k, weights='uniform')
    knn_uniform.fit(X_train, y_train)
    
    train_acc_uniform = accuracy_score(y_train, knn_uniform.predict(X_train))
    test_acc_uniform = accuracy_score(y_test, knn_uniform.predict(X_test))
    
    train_accuracies_uniform.append(train_acc_uniform)
    test_accuracies_uniform.append(test_acc_uniform)
    
    # Distance weights
    knn_distance = KNeighborsClassifier(n_neighbors=k, weights='distance')
    knn_distance.fit(X_train, y_train)
    
    train_acc_distance = accuracy_score(y_train, knn_distance.predict(X_train))
    test_acc_distance = accuracy_score(y_test, knn_distance.predict(X_test))
    
    train_accuracies_distance.append(train_acc_distance)
    test_accuracies_distance.append(test_acc_distance)

# Plotting the results
plt.figure(figsize=(10, 6))

# Uniform weights plot
plt.plot(k_values, train_accuracies_uniform, label='Train Accuracy (Uniform)', marker='o')
plt.plot(k_values, test_accuracies_uniform, label='Test Accuracy (Uniform)', marker='o')

# Distance weights plot
plt.plot(k_values, train_accuracies_distance, label='Train Accuracy (Distance)', marker='x')
plt.plot(k_values, test_accuracies_distance, label='Test Accuracy (Distance)', marker='x')

plt.title('Train and Test Accuracies for Different k-Values')
plt.xlabel('Number of Neighbors (k)')
plt.ylabel('Accuracy')
plt.legend()
plt.grid(True)
plt.show()

### Exercicio 2b)

The number of neighbors (k) in a kNN classifier has a profound impact on its generalization ability. As k increases, the model shifts from being sensitive to individual training samples to a more generalized perspective that considers broader patterns in the data.

When k is low, the model may memorize the training data, resulting in high training accuracy but poor performance on unseen data due to overfitting. As 
k increases, the model starts to average predictions over more neighbors, which can reduce the impact of noise and improve generalization, leading to higher test accuracy. This demonstrates that moderate values of k can help the model better reflect the true underlying distribution of the data.

Beyond a certain k, the model can become too generalized, losing the ability to discern important patterns. This underfitting occurs because averaging predictions across too many neighbors can dilute the influence of relevant data points, decreased accuracy.

### Exercicio 3

Here are two possible difficulties for the Naïve Bayes model when learning from this dataset:

1. Handling Continuous Features: Naïve Bayes models assume that features are categorical or follow a specific distribution (Gaussian, for example). However, several features in this dataset are continuous (e.g., age, trestbps, chol, thalach, oldpeak). The model may need to make assumptions for these variables, which may not hold true and could reduce its accuracy.

2. Correlated Features: Naïve Bayes assumes that all features are independent, but in medical datasets like this one, many features are likely to be correlated. For instance, blood pressure (trestbps) and cholesterol (chol) might be related, and the presence of such correlations can violate the independence assumption, leading to suboptimal model performance.