# **Analysis of Supervised and Unsupervised Learning on UCI Dataset (ID: 267)**  

## **1. Data Pretreatment**  


In [None]:
import os  
import numpy as np  
import pandas as pd 
import seaborn as sns  
import matplotlib.pyplot as plt  

from numpy import linalg as LA  
from tqdm import tqdm  
from scipy.spatial.distance import cdist, euclidean, minkowski  

from sklearn import preprocessing  
from sklearn.decomposition import PCA  
from sklearn.manifold import TSNE  
from sklearn.cluster import KMeans, DBSCAN  
from sklearn.neighbors import NearestNeighbors  
from sklearn.linear_model import LinearRegression, LogisticRegression  
from sklearn.model_selection import KFold, train_test_split  
from sklearn.metrics import (
    accuracy_score, classification_report, confusion_matrix, 
    ConfusionMatrixDisplay, normalized_mutual_info_score,
)



### **1.1 Data Loading and Inspection**  
The dataset was obtained from the UCI Machine Learning Repository using `fetch_ucirepo(id=267)`. After loading the data, we examined its structure, including the number of samples , features, and class distribution. From the repository we have:
1. The data has no missing value
2. The class distribution (the target) is already encoded (either 1 or 0)
3. There are 1372 samples and 4 features
4. The four features are continuos

data: [1372 rows x 5 columns]
| variance | skewness | curtosis | entropy  | targets |
|----------|----------|----------|----------|---------|
| 3.62160  | 8.66610  | -2.8073  | -0.44699 | 0       |
| 4.54590  | 8.16740  | -2.4586  | -1.46210 | 0       |
| 3.86600  | -2.63830 | 1.9242   | 0.10645  | 0       |
| 3.45660  | 9.52280  | -4.0112  | -3.59440 | 0       |
| 0.32924  | -4.45520 | 4.5718   | -0.98880 | 0       |
| ...      | ...      | ...      | ...      | ...     |
| 0.40614  | 1.34920  | -1.4501  | -0.55949 | 1       |
| -1.38870 | -4.87730 | 6.4774   | 0.34179  | 1       |
| -3.75030 | -13.45860| 17.5932  | -2.77710 | 1       |
| -3.56370 | -8.38270 | 12.3930  | -1.28230 | 1       |
| -2.54190 | -0.65804 | 2.6842   | 1.19520  | 1       |

In [163]:
try:
    from ucimlrepo import fetch_ucirepo
    # fetch dataset 
    banknote_authentication = fetch_ucirepo(id=267)
    # construct the datafram
    data = banknote_authentication["data"]["features"]
    data["targets"] = banknote_authentication["data"]["targets"]
    print(data)
except:
    FFILE = './data_banknote_authentication.txt'
    if os.path.isfile(FFILE): 
        print("File already exists")
        if os.access(FFILE, os.R_OK):
            print ("File is readable")
        else:
            print ("File is not readable")
    else:
        print("Either the file is missing or not readable, download it and extract the zip")
        !wget "https://archive.ics.uci.edu/static/public/267/banknote+authentication.zip"
    
    # Load the data

    column_names = ["variance", "skewness", "curtosis", "entropy", "targets"]
    data = pd.read_csv('./data_banknote_authentication.txt', names=column_names)
    print(data)

      variance  skewness  curtosis  entropy  targets
0      3.62160   8.66610   -2.8073 -0.44699        0
1      4.54590   8.16740   -2.4586 -1.46210        0
2      3.86600  -2.63830    1.9242  0.10645        0
3      3.45660   9.52280   -4.0112 -3.59440        0
4      0.32924  -4.45520    4.5718 -0.98880        0
...        ...       ...       ...      ...      ...
1367   0.40614   1.34920   -1.4501 -0.55949        1
1368  -1.38870  -4.87730    6.4774  0.34179        1
1369  -3.75030 -13.45860   17.5932 -2.77710        1
1370  -3.56370  -8.38270   12.3930 -1.28230        1
1371  -2.54190  -0.65804    2.6842  1.19520        1

[1372 rows x 5 columns]


In [164]:
print("(sample, features):", data.shape)
print("First class count:", sum(data['targets']==0))
print("Second class count:", sum(data['targets']==1))

(sample, features): (1372, 5)
First class count: 762
Second class count: 610



### **1.2 Scaling, Normalization and Sorting Issues in the Dataset** 
The dataset consists of numerical features, but their values are on different scales. To ensure proper model training and clustering, **feature scaling** was applied using standardization (z-score normalization). I.e.
- Max of all features = 17.9274 while the Min is -13.7731

The data was immediately split into train and test sets, rescaled, and then reassembled for models that require the full dataset.

full dataset randomized and scaled: [1372 rows x 5 columns]
| variance | skewness | curtosis | entropy  | targets |
|----------|----------|----------|----------|---------|
| 0.904618 | 1.601126 | -1.265374 | -1.495569 | 0.0     |
| 1.532814 | -0.691013 | -0.000450 | 0.973356 | 0.0     |
| -0.367168 | -1.662094 | 1.257462 | 0.697353 | 1.0     |
| -2.299623 | 1.344148 | -0.419396 | -2.767430 | 1.0     |
| -0.539056 | -0.520896 | 0.148416 | 0.520688 | 1.0     |
| ...      | ...      | ...      | ...      | ...     |
| 0.706408 | 0.908746 | -0.465262 | 0.769656 | 0.0     |
| 1.130878 | 0.958700 | -0.751494 | 0.639514 | 0.0     |
| -1.804741 | 0.344855 | -0.217882 | -0.196042 | 1.0     |
| -0.369069 | -0.631649 | -0.471420 | 0.563440 | 1.0     |
| 1.394944 | -1.047881 | 0.753370 | 1.083987 | 0.0     



In [None]:
np.random.seed(42)
X = data.iloc[:,:-1].values
y_scaled = data.iloc[:,4].values
N = X.shape[0]
nc = X.shape[1]

In [None]:
# This line splits the dataset into training and test sets
X_train, X_test, train_y, test_y = train_test_split(X, y_scaled, test_size=0.728863, random_state=42)

In [None]:
# This line Ensures all features have a mean of 0 and variance of 1
scaler = preprocessing.StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)  # Scale training data
X_test_scaled = scaler.transform(X_test)

In [None]:
# Contruct the dataframes 
# A subset of 372 elements was set aside as the test set, 
# the remaining data was used for training models 
# all of which to be used in supervised learning. 
combined_train = np.hstack((X_train_scaled, train_y.reshape(-1, 1)))
data_scaled_train = pd.DataFrame(combined_train, columns=data.columns.tolist())

combined_test = np.hstack((X_test_scaled, test_y.reshape(-1, 1)))
data_scaled_test = pd.DataFrame(combined_test, columns=data.columns.tolist())

full_data_scaled = pd.concat([data_scaled_train, data_scaled_test])

In [None]:
# Feature after randomizing
X_scaled = np.array(full_data_scaled.iloc[:, :-1])
y_scaled = np.array(full_data_scaled.iloc[:, -1])

In [None]:
# Trasform from column to row
train_y = train_y.ravel()
test_y = test_y.ravel()

----


## **2. Unsupervised Learning**  

### **2.1 PCA for Visualization**  
**Principal Component Analysis (PCA)** was applied to reduce the dataset to two dimensions for visualization. The first two principal components were plotted, with points colored by their actual class labels.

**Observations (from plot):**  
- The classes are **not linearly separable** in this reduced space.  
- Some overlap between clusters suggests that linear models might struggle with classification.  



In [None]:
# Performing Principal Component Analysis (PCA) using sklearn
pca = PCA(n_components=2)

# Fitting the PCA model to the scaled data
pca.fit(X_scaled)

# Transforming the original data to the principal components
projection = pca.transform(X_scaled)

# Calculating the cumulative explained variance ratio
cumulative_variance = np.zeros(nc)
for i in range(nc):
    cumulative_variance[i] = np.sum(pca.explained_variance_ratio_[:i+1])

# Extracting eigenvalues and component indices
eigenvector = pca.components_
eigenvalues = pca.explained_variance_
components = np.arange(nc) + 1

In [None]:
fig, ax = plt.subplots(figsize=(7, 7))
ax.scatter(projection[:, 0], projection[:, 1], c=y_scaled)
ax.set_title('2D PCA Visualization')
plt.show()

### **2.2 K-Means Clustering**  
**K-Means clustering** was applied with `k=2` (assuming two clusters).  

**Results:**  
- When using **only the first two PCA components**, k-means **misclassified several points**, showing that a 2D projection may not contain enough information.  
- When using **all features**, clustering improved but misclassifications persisted.   

We also obtain the following table:
| Metric        | 0.0  | 1.0  | Macro Avg | Weighted Avg | Accuracy |
|--------------|------|------|-----------|--------------|----------|
| Precision    | 0.50 | 0.38 | 0.44      | 0.45         |   -   |
| Recall       | 0.46 | 0.42 | 0.44      | 0.44         |    -      |
| F1-Score     | 0.48 | 0.40 | 0.44      | 0.44         |    -      |
| Support      | 762  | 610  | 1372      | 1372         |   -       |
| Accuracy      | -  | -  | -      | -         |   0.44       |

From which:
1. Precision
    - For class 0.0, the model's precision is 0.50, meaning that when it predicts class 0, it is correct 50% of the time.
    - For class 1.0, the precision is 0.38, so when it predicts class 1, it is correct 38% of the time.

2. Recall:
    - For class 0.0, recall is 0.46, meaning the model correctly identifies 46% of actual class 0 instances.
    - For class 1.0, recall is 0.42, meaning it correctly identifies 42% of actual class 1 instances.

3. F1-Score:
    - For class 0.0, the F1-score is 0.48, indicating a balance between precision and recall.
    - For class 1.0, the F1-score is 0.40, showing lower performance in predicting this class.

The model correctly classifies 44% of the total samples.

comparing it to the table that takes the full data set:
| Metric        | 0.0  | 1.0  | Accuracy | Macro Avg | Weighted Avg |
|--------------|------|------|----------|-----------|--------------|
| Precision    | 0.61 | 0.50 | -     | 0.56      | 0.56         |
| Recall       | 0.55 | 0.57 |   -   | 0.56      | 0.56         |
| F1-Score     | 0.58 | 0.53 |   -   | 0.56      | 0.56         |
| Support      | 762  | 610  |   -   | 1372      | 1372         |
| Accuracy      | -  | -  | 0.56      | -         |   -       |

From which it is clear that PCA results in information loss as all metrics increase

In [None]:
# Compute the pairwise Euclidean distance matrix between unique rows of the scaled dataset
unique_scaled_data = np.unique(X_scaled, axis=0)
distance_matrix = cdist(unique_scaled_data, unique_scaled_data)
distance_matrix.sort(axis=1)

# Calculate the ratio of the second-nearest neighbor distance to the nearest neighbor distance
mu_i = np.divide(distance_matrix[:, 2], distance_matrix[:, 1])

# Compute the logarithm of the calculated ratios
log_mu_i = np.log(mu_i)

# Calculate the inverse of the mean logarithm of the second-nearest neighbor ratios
two_nn = 1 / np.mean(log_mu_i)

# Print the resulting value
print("Two-Nearest-Neighbor (2NN) Distance Measure:", two_nn)

In [None]:
def k_means_internal(k, X, init):
    '''
    Perform k-means clustering.

    Parameters
    ----------
    k : int
        Number of clusters.
    X : matrix of dimension N x D
        Dataset.
    init : str, {'++', 'random'}
        Type of initialization for k-means algorithm.

    Returns
    -------
    tuple
        z_new : array
            Cluster assignments for each data point.
        L : float
            Final value of the k-means objective function (loss).
        niter : int
            Number of iterations performed.
    '''
    N = X.shape[0]  # number of points
    nc = X.shape[1]  # number of coordinates
    ll = np.arange(k)
    z = np.zeros(N, dtype='int')  # cluster number assigned to each data point
    cent = np.zeros([k, nc])  # coordinates of the cluster centers

    # k-means++
    if init == '++':
        b = np.random.choice(N, 1, replace=False)  # choose the first cluster center at random
        cent[0, :] = X[b, :]
        nchosen = 1  # number of cluster centers already set

        while nchosen < k:
            dist = cdist(cent[:nchosen, :], X)  # distance of each point from the cluster centers
            dmin = np.min(dist, axis=0)  # min distance between point and cluster centers
            prob = dmin**2
            prob = prob / np.sum(prob)

            # choose next center according to the computed prob
            b = np.random.choice(N, 1, replace=False, p=prob)
            cent[nchosen, :] = X[b, :]
            nchosen += 1

    # random initialization
    else:
        b = np.random.choice(N, k, replace=False)  # choose the k centers randomly
        for i in ll:
            cent[i, :] = X[b[i], :]

    dist = cdist(cent, X)  # distance of each point from cluster centers
    z_new = np.argmin(dist, axis=0)  # assign each point to cluster with the closest center
    dmin = np.min(dist, axis=0)
    niter = 0
    L = np.sum(dmin**2)  # loss function evaluation

    while (z_new != z).any():  # until a stable configuration is reached
        z = np.copy(z_new)

        for i in range(k):
            cent[i, :] = np.average(X[z == i, :], axis=0)  # compute cluster centroids

        dist = cdist(cent, X)  # update distances from cluster centers
        z_new = np.argmin(dist, axis=0)  # find cluster with the minimum centroid distance
        dmin = np.min(dist, axis=0)
        L = np.sum(dmin**2)  # loss function evaluation
        niter += 1

    return z_new, L, niter

In [None]:
def k_means(k, X, init='++', n_init=20):
    '''
    Perform k-means clustering with multiple initializations to find the best result.

    Parameters
    ----------
    k : int
        Number of clusters.
    X : matrix of dimension N x D
        Dataset.
    init : str, {'++', 'random'}, optional
        Type of initialization for k-means algorithm.
    n_init : int, optional
        Number of runs of the algorithm with different initializations.

    Returns
    -------
    tuple
        labels_opt : array
            Cluster assignments for each data point in the best-performing iteration.
        lmin : float
            Loss (objective function) for the best-performing iteration.
    '''
    lmin = float('inf')  # Initialize with a large value
    labels_opt = None

    for i in range(n_init):
        # Run k-means for each initialization
        labels, loss, niter = k_means_internal(k, X, init=init)

        # Check if the current iteration has a lower loss
        if loss < lmin:
            lmin = loss
            labels_opt = labels

    return labels_opt, lmin

In [None]:
kmeans_labels, l_kmeans = k_means(2, projection, init='++', n_init=20)

In [None]:
fig = plt.figure(figsize=(9, 9))
ax = fig.add_subplot()
ax.scatter(projection[:,0], projection[:,1], c=kmeans_labels)
plt.show()
print("clustering comparison: ", normalized_mutual_info_score(kmeans_labels, y_scaled.flatten()))

In [None]:
print("k-means + pca with n=2: \n", classification_report(y_scaled, kmeans_labels))

In [None]:
kmeans_labels, l_kmeans = k_means(2, np.array(full_data_scaled.iloc[:, :-1]), init='++', n_init=20)

In [None]:
fig = plt.figure(figsize=(9, 9))
ax = fig.add_subplot()
ax.scatter(np.array(full_data_scaled.iloc[:, 0]), np.array(full_data_scaled.iloc[:, 1]), c=kmeans_labels)
plt.show()
print("clustering comparison: ", normalized_mutual_info_score(kmeans_labels, y_scaled.flatten()))

In [None]:
print("k-means with n=2: \n", classification_report(y_scaled, kmeans_labels))

### **2.3 t-SNE for Nonlinear Projection**  
We used **t-SNE** for dimensionality reduction and visualized the data in 2D.  

**Observations:**  
- t-SNE provided a **better separation** than PCA, suggesting some non-linear class structure.  
- The class distributions are still somewhat mixed, indicating potential challenges for clustering algorithms.  


In [None]:
# Embed the data into 2D using t-SNE
X_embedded = TSNE(n_components=2, learning_rate='auto', init='random', perplexity=15, random_state=42).fit_transform(X_scaled)

# Create a scatter plot of the embedded data, colored by ground truth labels
fig, ax = plt.subplots(figsize=(7, 7))
ax.scatter(X_embedded[:, 0], X_embedded[:, 1], c=y_scaled)

# Set plot title
ax.set_title('t-SNE Visualization')

# Display the plot
plt.show()

### **2.4 DBSCAN Clustering**  
We applied **DBSCAN**, a density-based clustering algorithm.  

**Results:**   
- It identified core clusters but also **classified some points as noise**.  
- The results depended significantly on hyperparameters `eps` and `min_samples`, of which eps is based on `n_neighbors` = 50.

### DBSCAN Metrics with noise removed

| Cluster | Precision | Recall | F1-Score | Support | **Accuracy** |
|---------|-----------|--------|----------|---------|---------|
| **0.0** | 1.00     | 0.83   | 0.91     | 521     | -  |
| **1.0** | 0.96     | 0.80   | 0.88     | 369     |-     |
| **2.0** | 0.00     | 0.00   | 0.00     | 0       |-     |
| **3.0** | 0.00     | 0.00   | 0.00     | 0       |-     |
| **Accuracy**  | -  | -  | - | - |**0.82**     |
| **Macro Avg** | 0.49 | 0.41 | 0.45 | 890 |-     |
| **Weighted Avg** | 0.99 | 0.82 | 0.90 | 890 |-     |

From which:
The table shows three clusters (0.0, 1.0, and 2.0), but clusters 2.0 and 3.0 have zero support, meaning no data points were assigned to them.
The majority of the data points are assigned to clusters 0.0 and 1.0.

1. Precision:
    - Cluster 0.0 has a precision of 1.00, meaning all points assigned to this cluster were correctly grouped (no false positives).
    - Cluster 1.0 has a precision of 0.96, indicating that most points were correctly assigned, but a few may have been misclassified.

2. Recall:
    - Cluster 0.0 has a recall of 0.83, meaning 83% of the actual members of this cluster were successfully identified.
    - Cluster 1.0 has a recall of 0.80, meaning 80% of the actual points belonging to this cluster were captured.
    - Since DBSCAN removes noise points, recall is slightly lower, as some valid points may have been left unclustered.

3. F1-Score:
    - Cluster 0.0: 0.91 (high, meaning both precision and recall are strong).
    - Cluster 1.0: 0.88 (also high, but slightly lower than cluster 0.0).

And given accuracy of 0.82, we have that 82% of points were correctly assigned to their respective clusters.

- Macro Average: The unweighted mean of precision, recall, and F1-score across clusters. Since clusters 2.0 and 3.0 have zero support, their presence lowers the macro average.
- Noise points were removed, improving the accuracy but slightly lowering recall (since some actual points were left out).
- Weighted Average: Averages the scores while considering the number of points in each cluster. The weighted values are high because the meaningful clusters (0.0 and 1.0) have strong performance.

In [None]:
# This piece of code means it will find the 50 nearest neighbors for each data point from which we can infer epsilon
neighbors = NearestNeighbors(n_neighbors=50)
neighbors_fit = neighbors.fit(X_scaled)
distances, indices = neighbors_fit.kneighbors(X_scaled)
distances = np.sort(distances, axis=0)
distances = distances[:,1]

In [None]:
fig, axs = plt.subplots(1, 2, figsize=(12, 5))

# First plot: Full range
axs[0].plot(distances)
axs[0].set_title("Full Range")
axs[0].set_xlabel("Index")
axs[0].set_ylabel("Distances")

# Second plot: Limited x-axis
axs[1].plot(distances)
axs[1].set_xlim(1360, 1380)
axs[1].set_title("Limited X-Range [1360, 1380]")
axs[1].set_xlabel("Index")
axs[1].set_ylabel("Distances")

# Show the plots
plt.tight_layout()
plt.show()

In [None]:
# Apply DBSCAN clustering algorithm to the scaled data
db = DBSCAN(eps=0.6, min_samples=50)
dbscan = db.fit(X_scaled)

# Visualize the clusters in a 3D scatter plot using the first three Principal Components
fig = plt.figure(figsize=(9, 9))
ax = fig.add_subplot()
ax.scatter(projection[:, 0], projection[:, 1], c=dbscan.labels_)

# Display the 3D plot
plt.show()


---



## **3. Supervised Learning**  



#### **3.1 Logistic Regression** 
- From the model we find a high accuracy of 0.9770 and by analyzing the confusion matrix, it is observed that the model makes incorrect predictions for only 23 out of 1372 instances.
- Evaluating the effect of **regularization** using cross-validation finds the best parameter as.

| Metric      | Best Score | Parameters                          |
|------------|-----------|------------------------------------|
| Accuracy   | 0.9900    | {'penalty': 'l1', 'C': 2.1544}   |
| Precision  | 0.9901    | {'penalty': 'l1', 'C': 10.0}     |
| Recall     | 0.9906    | {'penalty': 'l1', 'C': 2.1544}   |
| F1-Score   | 0.9898    | {'penalty': 'l1', 'C': 2.1544}   |

- So we have that the logistic model performed well using training data

In [174]:
# Initialize model
log_reg = LogisticRegression()

# Train (fit) the model
log_reg.fit(X_train_scaled, train_y.ravel())

# Predict on test data
y_pred = log_reg.predict(X_test_scaled)

# Evaluate accuracy
accuracy = accuracy_score(test_y.ravel(), y_pred)
print("Model Accuracy:", accuracy)

Model Accuracy: 0.977022977022977


In [None]:
# Confusion matrix that shows how many are predicted correctly
cm = confusion_matrix(test_y, y_pred)
ConfusionMatrixDisplay(cm).plot(cmap="Blues")
plt.title("Confusion Matrix")
plt.show()

In [None]:
# Define different regularization types
penalties = ['l1', 'l2', 'elasticnet']
C_values = np.logspace(-3, 3, 10)  
l1_ratios = np.linspace(0.1, 0.9, 5)  

# Store best scores
best_metrics = {'accuracy': 0, 'precision': 0, 'recall': 0, 'f1': 0}
best_params = {}

for pen in penalties:
    for C in C_values:
        if pen == 'elasticnet':
            for l1_ratio in l1_ratios:
                log_reg = LogisticRegression(C=C, penalty=pen, solver='saga', l1_ratio=l1_ratio, max_iter=15000)
                log_reg.fit(X_train_scaled, train_y)
                y_pred = log_reg.predict(X_test_scaled)

                # Compute metrics
                report = classification_report(test_y, y_pred, output_dict=True, zero_division=0)
                acc = report['accuracy']
                prec = report['macro avg']['precision']
                rec = report['macro avg']['recall']
                f1 = report['macro avg']['f1-score']

                # Update best metrics
                for metric, value in zip(['accuracy', 'precision', 'recall', 'f1'], [acc, prec, rec, f1]):
                    if value > best_metrics[metric]:
                        best_metrics[metric] = value
                        best_params[metric] = {'penalty': pen, 'C': C, 'l1_ratio': l1_ratio}

        else:
            log_reg = LogisticRegression(C=C, penalty=pen, solver='liblinear', max_iter=15000)
            log_reg.fit(X_train_scaled, train_y)
            y_pred = log_reg.predict(X_test_scaled)

            # Compute metrics
            report = classification_report(test_y, y_pred, output_dict=True, zero_division=0)
            acc = report['accuracy']
            prec = report['macro avg']['precision']
            rec = report['macro avg']['recall']
            f1 = report['macro avg']['f1-score']

            # Update best metrics
            for metric, value in zip(['accuracy', 'precision', 'recall', 'f1'], [acc, prec, rec, f1]):
                if value > best_metrics[metric]:
                    best_metrics[metric] = value
                    best_params[metric] = {'penalty': pen, 'C': C}

# Print the best values and parameters
for metric, value in best_metrics.items():
    print(f"Best {metric}: {value:.4f} (Params: {best_params[metric]})")


#### **3.2 Decision Tree (ID3 Algorithm)**  
- Greedy algorithm for tree construction.  
- Hyperparameters (depth, minimum samples per leaf) were optimized via cross-validation. 

From the algorithim we get an accuracy of 0.9470 for the training data.

Performing cross Validation we get:
- Mean Accuracy: 0.9410
- Best Accuracy: 0.9599
- Best Tree:

    - variance <= 0.0827  
    - skewness <= -0.2305 → **1.0**  
    - skewness > -0.2305  
        - skewness <= 0.3693 → **1.0**  
        - skewness > 0.3693  
        - variance <= -0.7789  
            - skewness <= 1.0834 → **1.0**  
            - skewness > 1.0834 → **0.0**  
        - variance > -0.7789 → **0.0**  
    - variance > 0.0827  
    - variance <= 0.9059  
        - curtosis <= -0.3654  
        - skewness <= 0.8130 → **1.0**  
        - skewness > 0.8130 → **0.0**  
        - curtosis > -0.3654 → **0.0**  
    - variance > 0.9059 → **0.0** 


In [None]:
# compute H(S)
def entropy(train_data, label, class_list):
    """
    Calculate the entropy of a dataset.

    Parameters
    ----------
    train_data : DataFrame
        The training dataset.
    label : str
        The name of the column representing the class labels.
    class_list : list of str
        List of possible values of the class labels.

    Returns
    -------
    total_entr : float
        The entropy of the dataset.
    """
    # Get the total number of instances in the dataset
    total_row = train_data.shape[0]
    # Initialize the total entropy variable
    total_entr = 0

    # Iterate through each possible class in the label
    for c in class_list:
        # Count the number of points belonging to the current class
        total_class_count = train_data[train_data[label] == c].shape[0]

        # Check if there are instances of the class to avoid numerical errors
        if total_class_count > 0:
            # Calculate the entropy of the current class
            total_class_entr = - (total_class_count / total_row) * np.log2(total_class_count / total_row)
            # Add the entropy of the current class to the total entropy of the dataset
            total_entr += total_class_entr

    # Return the calculated total entropy of the dataset
    return total_entr

# compute H(S_j)
def feature_entropy(left_data, right_data, label, class_list):
    """
    Calculate the conditional entropy of a dataset split by a specific feature.

    Parameters
    ----------
    left_data : DataFrame
        Subset of the dataset where the feature has a specific value.
    right_data : DataFrame
        Subset of the dataset where the feature has another value.
    label : str
        The name of the column representing the class labels.
    class_list : list of str
        List of possible values of the class labels.

    Returns
    -------
    ent : float
        The conditional entropy of the dataset split by the feature.
    """
    # Get the total number of points considered after the split
    row_count = left_data.shape[0] + right_data.shape[0]

    # Calculate the probabilities of the left and right subsets
    p_left = left_data.shape[0] / row_count
    p_right = right_data.shape[0] / row_count

    # Calculate the conditional entropy using the weighted average of entropies for left and right subsets
    ent = p_left * entropy(left_data, label, class_list) + p_right * entropy(right_data, label, class_list)

    # Return the calculated conditional entropy
    return ent

def split_dec_tree(feature_column, threshold):
    """
    Split the indices of data points based on a feature and a threshold.

    Parameters
    ----------
    feature_column : array-like
        The values of the feature for each data point.
    threshold : float
        The threshold value for splitting the data points.

    Returns
    -------
    left_rows : array-like
        Indices of data points where the feature value is less than or equal to the threshold.
    right_rows : array-like
        Indices of data points where the feature value is greater than the threshold.
    """
    # Find the indices of data points where the feature value is less than or equal to the threshold
    left_rows = np.argwhere(feature_column <= threshold).flatten()
    # Find the indices of data points where the feature value is greater than the threshold
    right_rows = np.argwhere(feature_column > threshold).flatten()

    # Return the indices for left and right subsets
    return left_rows, right_rows

def information_gain(data, feature_name, label, class_list, threshold):
    """
    Calculate the information gain after splitting the dataset based on a feature and a threshold.

    Parameters
    ----------
    data : DataFrame
        The dataset.
    feature_name : str
        The name of the feature for which information gain is calculated.
    label : str
        The name of the column representing the class labels.
    class_list : list of str
        List of possible values of the class labels.
    threshold : float
        The threshold value for splitting the dataset.

    Returns
    -------
    feat_information_gain : float
        The information gain achieved by splitting the dataset based on the specified feature and threshold.
    """
    # Split the dataset into left and right subsets based on the feature and threshold
    left_rows, right_rows = split_dec_tree(data[feature_name].values, threshold)

    # Check if either subset is empty; if so, information gain is zero
    if len(left_rows) == 0 or len(right_rows) == 0:
        return 0

    # Calculate the entropy of the split dataset
    feat_entropy = feature_entropy(data.iloc[left_rows], data.iloc[right_rows], label, class_list)

    return feat_entropy

def get_split_thresholds(feature_column, n_thresholds):
    """
    Generate candidate split thresholds for a given feature column.

    Parameters
    ----------
    feature_column : array-like
        The values of the feature for each data point.
    n_thresholds : int
        The number of thresholds to generate.

    Returns
    -------
    thresholds : list of float
        List of candidate split thresholds for the feature column.
    """
    # Extract the values of the feature column
    feature_column = feature_column.values
    # Get the total number of data points
    n_data = len(feature_column)

    # Sort the feature column in ascending order
    sorted_column = np.sort(feature_column)

    # Check if there is more than one data point
    if len(feature_column) > 1:
        # Split the sorted feature column into n_thresholds + 1 partitions
        partitioned_array = np.array_split(sorted_column, n_thresholds + 1)

        # Calculate the midpoint between consecutive partitions as candidate thresholds
        thresholds = [(partitioned_array[i][-1] + partitioned_array[i + 1][0]) / 2 for i in range(len(partitioned_array) - 1)]
    else:
        # If there is only one data point, use it as the threshold
        thresholds = [feature_column[0]]

    # Return the list of candidate split thresholds
    return thresholds

def most_informative_feature(train_data, label, class_list, n_thresholds):
    """
    Find the most informative feature and its corresponding threshold for splitting the dataset.

    Parameters
    ----------
    train_data : DataFrame
        The training dataset.
    label : str
        The name of the column representing the class labels.
    class_list : list of str
        List of possible values of the class labels.
    n_thresholds : int
        The number of thresholds to generate for each feature.

    Returns
    -------
    min_entropy_feature : str
        The name of the most informative feature.
    min_entropy_threshold : float
        The corresponding threshold for splitting the dataset based on the most informative feature.
    """
    # Get the list of features excluding the label
    feature_list = train_data.columns.drop(label)

    # Initialize variables to store the minimum entropy and corresponding feature and threshold
    min_entropy = float('inf')
    min_entropy_feature = None
    min_entropy_threshold = None

    # Iterate over each feature in the feature list
    for feature in feature_list:
        # Generate candidate split thresholds for the current feature
        thresholds = get_split_thresholds(train_data[feature], n_thresholds)

        # Iterate over each threshold
        for t in thresholds:
            # Calculate information gain for the current feature and threshold
            info_gain = information_gain(train_data, feature, label, class_list, t)

            # Check if the calculated information gain is less than the current minimum entropy
            if info_gain < min_entropy:
                # Update the minimum entropy and corresponding feature and threshold
                min_entropy = info_gain
                min_entropy_feature = feature
                min_entropy_threshold = t

    # Return the most informative feature and its corresponding threshold
    return min_entropy_feature, min_entropy_threshold

def is_leaf(train_data, label):
    """
    Check if a node in a decision tree is a leaf node.

    Parameters
    ----------
    train_data : DataFrame
        The dataset associated with the current node.
    label : str
        The name of the column representing the class labels.

    Returns
    -------
    bool
        True if the node is a leaf node (contains only one class), False otherwise.
    """
    # Get the unique classes in the current node
    classes_in_node = np.unique(train_data[label])

    # Check if there is only one class in the node
    if len(classes_in_node) == 1:
        # If there is only one class, the node is a leaf node
        return True
    else:
        # If there is more than one class, the node is not a leaf node
        return False
    
def leaf_class(train_data, label):
    """
    Determine the class of a leaf node in a decision tree.

    Parameters
    ----------
    train_data : DataFrame
        The dataset associated with the leaf node.
    label : str
        The name of the column representing the class labels.

    Returns
    -------
    leaf_class : str
        The class label assigned to the leaf node.
    """
    # Get the unique classes and their counts in the current leaf node
    class_list, count_class = np.unique(train_data[label], return_counts=True)

    # Find the index of the class with the highest count (most frequent class)
    idx = count_class.argmax()

    # Return the class label associated with the most frequent class in the leaf node
    return class_list[idx]

def make_tree(train_data, label, class_list, n_thresholds, cur_depth, min_samples, max_depth):
    """
    Recursively build a decision tree.

    Parameters
    ----------
    train_data : DataFrame
        The training dataset associated with the current node.
    label : str
        The name of the column representing the class labels.
    class_list : list of str
        List of possible values of the class labels.
    n_thresholds : int
        The number of thresholds to generate for each feature.
    cur_depth : int
        The current depth of the decision tree.
    min_samples : int
        The minimum number of samples required to split a node.
    max_depth : int
        The maximum depth of the decision tree.

    Returns
    -------
    tree : dict or str
        The constructed decision tree represented as a nested dictionary. If a leaf node, returns the class label.
    """
    # Check stopping conditions for creating a leaf node
    if is_leaf(train_data, label) or cur_depth >= max_depth or len(train_data) <= min_samples:
        return leaf_class(train_data, label)
    else:
        # Increment the current depth for the next level of recursion
        cur_depth += 1

        # Find the most informative feature and its corresponding threshold for splitting
        split_feature, split_threshold = most_informative_feature(train_data, label, class_list, n_thresholds)

        # Split the dataset into left and right subsets based on the feature and threshold
        left_rows, right_rows = split_dec_tree(train_data[split_feature].values, split_threshold)

        # Check if either subset is empty; if so, create a leaf node
        if len(left_rows) == 0 or len(right_rows) == 0:
            return leaf_class(train_data, label)
        else:
            # Build the subtree
            split_condition = "{} <= {}".format(split_feature, split_threshold)
            sub_tree = {split_condition: []}

            # Recursive calls for the left and right branches
            left_branch = make_tree(train_data.iloc[left_rows], label, class_list, n_thresholds, cur_depth, min_samples, max_depth)
            right_branch = make_tree(train_data.iloc[right_rows], label, class_list, n_thresholds, cur_depth, min_samples, max_depth)

            # Check if both branches result in the same leaf class; if so, make the subtree a leaf
            if left_branch == right_branch:
                sub_tree = left_branch
            else:
                # Grow the tree by adding left and right branches to the split condition
                sub_tree[split_condition].append(left_branch)
                sub_tree[split_condition].append(right_branch)

            return sub_tree
        
def id3(train_data_m, label, n_thresholds=1, min_samples=4, max_depth=5):
    """
    Build a decision tree using the ID3 algorithm.

    Parameters
    ----------
    train_data_m : DataFrame
        The training dataset.
    label : str
        The name of the column representing the class labels.
    n_thresholds : int, optional
        The number of thresholds to generate for each feature.
    min_samples : int, optional
        The minimum number of samples required to split a node.
    max_depth : int, optional
        The maximum depth of the decision tree.

    Returns
    -------
    tree : dict or str
        The constructed decision tree represented as a nested dictionary. If a leaf node, returns the class label.
    """
    # Create a copy of the training dataset
    train_data = train_data_m.copy()

    # Get the unique classes of the label
    class_list = train_data[label].unique()

    # Start the recursion by calling the make_tree function
    tree = make_tree(train_data, label, class_list, n_thresholds, 0, min_samples, max_depth)

    # Return the constructed decision tree
    return tree

def predict_dec_tree(test_point, tree):
    """
    Predict the class label for a given test point using a decision tree.

    Parameters
    ----------
    test_point : Series
        The test point for which the class label is predicted.
    tree : dict or str
        The decision tree used for prediction.

    Returns
    -------
    prediction : str
        The predicted class label for the test point.
    """
    # Base case: if the tree is a leaf node (a class label)
    if not isinstance(tree, dict):
        return tree

    # Recursive case: traverse the tree based on feature values
    question = list(tree.keys())[0]
    attribute, value = question.split(" <= ")

    # Check the condition and follow the appropriate branch
    if test_point[attribute] <= float(value):
        answer = tree[question][0]
    else:
        answer = tree[question][1]

    # Recursive call on the selected branch
    return predict_dec_tree(test_point, answer)

def evaluate_dec_tree(tree, test_data, label):
    """
    Evaluate the accuracy of a decision tree on a test dataset.

    Parameters
    ----------
    tree : dict or str
        The decision tree to be evaluated.
    test_data : DataFrame
        The test dataset.
    label : str
        The name of the column representing the class labels.

    Returns
    -------
    accuracy : float
        The accuracy of the decision tree on the test dataset.
    """
    correct_predict = 0
    wrong_predict = 0

    # Iterate over each row in the test dataset
    for index in tqdm(range(len(test_data.index))):
        # Predict the class label for the current test point
        result = predict_dec_tree(test_data.iloc[index], tree)

        # Check if the predicted value matches the expected value
        if result == test_data[label].iloc[index]:
            correct_predict += 1  # Increase correct count
        else:
            wrong_predict += 1  # Increase incorrect count

    # Calculate and return the accuracy
    accuracy = correct_predict / (correct_predict + wrong_predict)
    return accuracy

def cross_validate_id3(full_data, label, k=5, n_thresholds=1, min_samples=4, max_depth=5):
    """
    Perform k-fold cross-validation for the ID3 decision tree algorithm.

    Parameters
    ----------
    full_data : DataFrame
        The full dataset containing feature columns and the target label.
    label : str
        The column name representing the class labels.
    k : int, optional
        Number of folds for cross-validation (default is 5).
    n_thresholds : int, optional
        The number of thresholds to generate for each feature.
    min_samples : int, optional
        The minimum number of samples required to split a node.
    max_depth : int, optional
        The maximum depth of the decision tree.

    Returns
    -------
    best_tree : dict or str
        The best-performing decision tree model.
    best_accuracy : float
        The highest accuracy achieved across the k-folds.
    """

    kf = KFold(n_splits=k, shuffle=True, random_state=42)
    accuracies = []
    best_tree = None
    best_accuracy = 0

    for train_index, test_index in kf.split(full_data):
        train_data = full_data.iloc[train_index]
        test_data = full_data.iloc[test_index]

        # Train the decision tree using ID3
        tree = id3(train_data, label, n_thresholds, min_samples, max_depth)

        # Evaluate the decision tree
        accuracy = evaluate_dec_tree(tree, test_data, label)
        accuracies.append(accuracy)

        # Save the best tree model
        if accuracy > best_accuracy:
            best_accuracy = accuracy
            best_tree = tree

    mean_accuracy = np.mean(accuracies)
    print(f"Mean Accuracy: {mean_accuracy:.4f}")
    print(f"Best Accuracy: {best_accuracy:.4f}")

    return best_tree, best_accuracy

In [None]:
tree = id3(data_scaled_train, 'targets')
evaluate_dec_tree(tree, data_scaled_test, 'targets')

In [None]:
cross_validate_id3(full_data_scaled, 'targets', k=5, n_thresholds=1, min_samples=4, max_depth=5)


#### **3.3 Naive Bayes Classifier**  
- Assumes feature independence.  
- Performed well in some cases but had lower accuracy due to its strong assumptions.

From the Gaussian Naive Bayes Classifier we get 0.8421 accuracy and performing cross-validation we get:
- Mean Accuracy: 0.8382
- Best Accuracy: 0.8978

- Best parameters:
 
| Parameter       | Value |
|----------------|----------------------------------------------------------------------------------|
| n_labels       | 2 |
| unique_labels  | [0., 1.] |
| n_classes      | 2 |
| mean           | [[ 0.7461,  0.4655, -0.2144, -0.0412,  0. ], [-0.6915, -0.4813,  0.1274, -0.0640,  1. ]] |
| variance       | [[0.5237, 0.8054, 0.6522, 1.0811, 1e-9], [0.4388, 0.9101, 1.7591, 0.9921, 1e-9]] |
| prior          | [-0.6161, -0.7767] |
| Score          | 0.8978 |


In [None]:
def prior_gauss_bayes(train_data, label):
    """
    Calculate the log prior probabilities for each class in the dataset.

    Parameters
    ----------
    train_data : DataFrame
        The training dataset.
    label : str
        The name of the column representing the class labels.

    Returns
    -------
    priors : array-like
        The log prior probabilities for each class.
    """
    # Calculate the prior probabilities for each class
    priors = train_data.groupby(by=label).apply(lambda x: len(x) / len(train_data))

    # Return the log of the prior probabilities as an array
    return np.log(priors).values


def mean_variance(train_data, label):
    """
    Calculate the mean and variance for each feature in the dataset, grouped by class.

    Parameters
    ----------
    train_data : DataFrame
        The training dataset.
    label : str
        The name of the column representing the class labels.

    Returns
    -------
    mean : array-like
        The mean values for each feature and class.
    variance : array-like
        The variance values for each feature and class.
    """
    # Calculate the mean values for each feature and class
    mean = train_data.groupby(by=label).apply(lambda x: x.mean(axis=0))

    # Calculate the variance values for each feature and class
    variance = train_data.groupby(by=label).apply(lambda x: x.var(axis=0))

    # Return the mean and variance as arrays
    return (mean.values, variance.values + 1e-9)


def gaussian_density(mean, variance, point):
    """
    Calculate the Gaussian probability density for a given point.

    Parameters
    ----------
    mean : array-like
        The mean values for each feature and class.
    variance : array-like
        The variance values for each feature and class.
    point : array-like
        The values of the features for a given point.

    Returns
    -------
    density : array-like
        The Gaussian probability density for the given point.
    """
    # Calculate the Gaussian probability density for each feature
    d = (1 / np.sqrt(2*np.pi*variance)) * np.exp((-(point - mean)**2) / (2*variance))

    # Return the density as an array
    return d


def train_gaussian_naive_bayes(train_data, label):
    """
    Train a Gaussian Naive Bayes classifier.

    Parameters
    ----------
    train_data : DataFrame
        The training dataset.
    label : str
        The name of the column representing the class labels.

    Returns
    -------
    model : dict
        A dictionary containing the parameters of the trained Gaussian Naive Bayes model.
    """
    # Calculate the mean and variance for each feature and class
    mean, variance = mean_variance(train_data, label)

    # Calculate the log prior probabilities for each class
    priors = prior_gauss_bayes(train_data, label)

    # Get unique class labels and their count
    unique_labels = train_data[label].unique()
    n_labels = len(unique_labels)

    # Construct and return the Gaussian Naive Bayes model
    return {'n_labels': n_labels, 'unique_labels': unique_labels, 'n_classes': n_labels, 'mean': mean,
            'variance': variance, 'prior': priors}

def posterior_gauss_bayes(point, mean, variance, class_list, n_classes, n_feat):
    """
    Calculate the log posterior probabilities for each class given a data point.

    Parameters
    ----------
    point : array-like
        The values of the features for a given data point.
    mean : array-like
        The mean values for each feature and class.
    variance : array-like
        The variance values for each feature and class.
    class_list : array-like
        The unique class labels.
    n_classes : int
        The number of classes.
    n_feat : int
        The number of features.

    Returns
    -------
    posteriors : array-like
        The log posterior probabilities for each class.
    """
    posteriors = []
    for i in range(n_classes):
        posterior = 0
        for j in range(n_feat):
            posterior += np.log(gaussian_density(mean[i][j], variance[i][j], point[j]))
        posteriors.append(posterior)
    return posteriors


def predict_gauss_bayes(test_data, label, gaus_bayes):
    """
    Predict the class labels for a given test dataset using a trained Gaussian Naive Bayes model.

    Parameters
    ----------
    test_data : DataFrame
        The test dataset.
    label : str
        The name of the column representing the class labels.
    gaus_bayes : dict
        A dictionary containing the parameters of the trained Gaussian Naive Bayes model.

    Returns
    -------
    predictions : array-like
        The predicted class labels for the test dataset.
    """
    predictions = []
    n_feat = len(test_data.columns) - 1
    for i in range(len(test_data)):
        pr = gaus_bayes['prior']
        post = posterior_gauss_bayes(test_data.iloc[i, :-1], gaus_bayes['mean'], gaus_bayes['variance'],
                         gaus_bayes['unique_labels'], gaus_bayes['n_classes'], n_feat)
        prob = pr + post
        max_prob_class_idx = np.argmax(prob)
        predictions.append(gaus_bayes['unique_labels'][max_prob_class_idx])
    return predictions


def evaluate_gaus_naive_bayes(test_data, label, gaus_bayes):
    """
    Evaluate the accuracy of a Gaussian Naive Bayes model on a test dataset.

    Parameters
    ----------
    test_data : DataFrame
        The test dataset.
    label : str
        The name of the column representing the class labels.
    gaus_bayes : dict
        A dictionary containing the parameters of the trained Gaussian Naive Bayes model.

    Returns
    -------
    accuracy : float
        The accuracy of the Gaussian Naive Bayes model on the test dataset.
    """
    gaus_pred = predict_gauss_bayes(test_data, label, gaus_bayes)
    correct_predict = 0
    wrong_predict = 0
    for index in tqdm(range(len(test_data.index))):
        if gaus_pred[index] == test_data[label].iloc[index]:
            correct_predict += 1
        else:
            wrong_predict += 1
    accuracy = correct_predict / (correct_predict + wrong_predict)
    return accuracy

def cross_validate_naive_bayes(full_data, label, k=5):
    """
    Perform k-fold cross-validation for Gaussian Naïve Bayes using the existing training and evaluation functions.

    Parameters
    ----------
    data : DataFrame
        The full dataset containing feature columns and the target label.
    label : str
        The column name representing the class labels.
    k : int, optional
        Number of folds for cross-validation (default is 5).

    Returns
    -------
    best_model : dict
        The best-performing Naïve Bayes model.
    best_accuracy : float
        The highest accuracy achieved.
    """

    kf = KFold(n_splits=k, shuffle=True, random_state=42)
    accuracies = []
    best_model = None
    best_accuracy = 0

    for train_index, test_index in kf.split(full_data):
        train_data = full_data.iloc[train_index]
        test_data = full_data.iloc[test_index]

        # Train the Gaussian Naïve Bayes model
        gaus_bayes = train_gaussian_naive_bayes(train_data, label)

        # Evaluate the model
        accuracy = evaluate_gaus_naive_bayes(test_data, label, gaus_bayes)
        accuracies.append(accuracy)

        # Save the best model
        if accuracy > best_accuracy:
            best_accuracy = accuracy
            best_model = gaus_bayes

    print(f"Mean Accuracy: {np.mean(accuracies):.4f}")
    print(f"Best Accuracy: {best_accuracy:.4f}")
    return best_model, best_accuracy


In [None]:
gaus_bayes = train_gaussian_naive_bayes(data_scaled_train, 'targets')
evaluate_gaus_naive_bayes(data_scaled_test, 'targets', gaus_bayes)

In [None]:
full_data = pd.concat([data_scaled_test,data_scaled_train])
cross_validate_naive_bayes(full_data, 'targets', 5)


#### **3.4 k-Nearest Neighbors (k-NN)**  
- Hyperparameter `k` was tuned via cross-validation.  
- Performed well but computationally expensive.

Assuming p = 2 and 5 clusters then we have an accuracy of 0.9900

Performing cross-validation we get:
- Best Hyperparameters: k=2, distance=euclidean, p=1
- Best Cross-Validation Accuracy: 0.9985


In [None]:
def distance_knn(point_one, point_two, dist, p):
    """
    Calculate the Euclidean distance between two points.

    Parameters
    ----------
    point_one : array-like
        Coordinates of the first point.
    point_two : array-like
        Coordinates of the second point.
    dist: str
        Allow to choose between Euclidean or Minkowski distance.
    p: int
        Order of the norm, only used with Minkowski distance.

    Returns
    -------
    float
        Euclidean or Minkowski distance between the two points.
    """
    if dist == 'euclidean':
        return euclidean(point_one, point_two)
    else:
        return minkowski(point_one, point_two, p=p)


def get_neighbors_knn(train_set, test_point, label_col, n_neighbors, dist, p):
    """
    Get the nearest neighbors of a test point in the training set.

    Parameters
    ----------
    train_set : array-like
        The training set containing data points.
    test_point : array-like
        The test point for which neighbors are to be found.
    label_col : array-like
        The labels corresponding to the training set.
    n_neighbors : int
        The number of neighbors to retrieve.
    dist: str
        Allow to choose between Euclidean or Minkowski distance.
    p: int
        Order of the norm, only used with Minkowski distance.

    Returns
    -------
    ordered_train : array-like
        The nearest neighbors in the training set.
    ordered_label : array-like
        The corresponding labels of the nearest neighbors.
    """
    # Calculate distances between the test point and all points in the training set
    dist = np.array([distance_knn(train_point, test_point, dist, p) for train_point in train_set])
    # Get indices that would sort the distances in ascending order
    idx_dist = dist.argsort()
    # Order the training set and labels based on the sorted distances
    ordered_train = train_set[idx_dist, :]
    ordered_label = label_col[idx_dist]
    # Return the top n_neighbors neighbors and their labels
    return ordered_train[:n_neighbors], ordered_label[:n_neighbors]

def predict_knn(train_set, test_point, labels, n_neighbors, dist, p):
    """
    Predict the label of a test point using k-nearest neighbors.

    Parameters
    ----------
    train_set : array-like
        The training set containing data points.
    test_point : array-like
        The test point for which the label is to be predicted.
    labels : array-like
        The labels corresponding to the training set.
    n_neighbors : int
        The number of neighbors to consider for the prediction.
    dist: str
        Allow to choose between Euclidean or Minkowski distance.
    p: int
        Order of the norm, only used with Minkowski distance.

    Returns
    -------
    predicted_label : array-like
        The predicted label for the test point.
    """
    # Get the nearest neighbors and their labels
    neigh, neigh_label = get_neighbors_knn(train_set, test_point, labels, n_neighbors, dist, p)
    # Count occurrences of each label among the neighbors
    values, counts = np.unique(neigh_label, return_counts=True)
    # Find the label with the highest count (majority class)
    idx = np.argmax(counts)
    # Return the predicted label
    return values[idx]

def evaluate_knn(train_set, test_set, label, n_neighbors=2, dist='Euclidean', p=2):
    """
    Evaluate the accuracy of k-nearest neighbors algorithm on a test set.

    Parameters
    ----------
    train_set : DataFrame
        The training dataset.
    test_set : DataFrame
        The test dataset.
    label : str
        The name of the column representing the class labels.
    n_neighbors : int, optional
        The number of neighbors to consider for the prediction. Default is 2.
    dist: str
        Allow to choose between Euclidean or Minkowski distance.
    p: int
        Order of the norm, only used with Minkowski distance.

    Returns
    -------
    accuracy : float
        The accuracy of the k-nearest neighbors algorithm on the test set.
    """
    # Initialize counters for correct and incorrect predictions
    correct_predict = 0
    wrong_predict = 0
    # Extract labels and features from the training and test sets
    
    train_labels = train_set[label].values
    train_set = train_set.drop(label, axis=1)
    
    test_labels = test_set[label].values
    test_set = test_set.drop(label, axis=1)
    # Iterate through each row in the test dataset
    for index in tqdm(range(len(test_set.index))):
        # Predict the class label for the current test row
        result = predict_knn(train_set.values, test_set.iloc[index].values, train_labels, n_neighbors, dist, p)
        # Check if the predicted value matches the actual value
        if result == test_labels[index]:
            # Increase the correct prediction count
            correct_predict += 1
        else:
            # Increase the incorrect prediction count
            wrong_predict += 1

    # Calculate and return the accuracy
    accuracy = correct_predict / (correct_predict + wrong_predict)
    return accuracy

In [None]:
knn_accuracy = evaluate_knn(data_scaled_train, data_scaled_test, 'targets', n_neighbors=5)
knn_accuracy

In [None]:
# Define hyperparameters to test
k_values = [1, 3, 5, 7, 9]
distance_metrics = ['euclidean', 'minkowski']
p_values = [1, 2]

# Number of folds for cross-validation
kf = KFold(n_splits=5, shuffle=True, random_state=42)

# Store results
results = []

# Perform cross-validation for each combination of hyperparameters
for k in k_values:
    for dist in distance_metrics:
        for p in p_values:
            accuracies = [
                evaluate_knn(
                    full_data.iloc[train_idx],
                    full_data.iloc[val_idx],
                    label='targets',
                    n_neighbors=k,
                    dist=dist,
                    p=p
                )
                for train_idx, val_idx in kf.split(full_data)
            ]
            
            # Store average accuracy for the current hyperparameters
            results.append((k, dist, p, np.mean(accuracies)))

# Find best hyperparameter combination
best_params = max(results, key=lambda x: x[3])

# Print best parameters
print(f"Best Hyperparameters: k={best_params[0]}, distance={best_params[1]}, p={best_params[2]}")
print(f"Best Cross-Validation Accuracy: {best_params[3]:.4f}")

In [None]:
def evaluate_model(model, test_data, label, model_type="naive_bayes"):
    """
    Evaluate a trained model on the test set using accuracy, precision, recall, and F1-score.

    Parameters
    ----------
    model : dict or None
        The trained model (Decision Tree or Naïve Bayes). Not required for KNN.
    test_data : DataFrame
        The test dataset.
    label : str
        The column name representing the class labels.
    model_type : str
        Type of model ("naive_bayes", "decision_tree", "knn").

    Returns
    -------
    None (Prints the evaluation results).
    """

    if model_type == "naive_bayes":
        # Predict labels using Gaussian Naïve Bayes
        predictions = predict_gauss_bayes(test_data, label, model)

    elif model_type == "decision_tree":
        # Predict labels using Decision Tree
        predictions = [predict_dec_tree(row, model) for _, row in test_data.iterrows()]

    elif model_type == "knn":
        # Extract labels and features
        train_labels = data_scaled_train[label].values
        train_features = data_scaled_train.drop(label, axis=1).values
        test_labels = test_data[label].values
        test_features = test_data.drop(label, axis=1).values

        # Predict labels for test set using KNN
        predictions = [
            predict_knn(train_features, test_features[i], train_labels, n_neighbors=1, dist='minkowski', p=1)
            for i in range(len(test_features))
        ]

    # True labels
    y_true = test_data[label].values

    # Print classification report
    print(f"Evaluation for {model_type.upper()}:")
    print(classification_report(y_true, predictions, digits=4))
    print("-" * 50)


# Train models
nb_model = train_gaussian_naive_bayes(data_scaled_train, 'targets')
dt_model = id3(data_scaled_train, 'targets')

# Evaluate each model
evaluate_model(nb_model, data_scaled_test, 'targets', model_type="naive_bayes")
evaluate_model(dt_model, data_scaled_test, 'targets', model_type="decision_tree")
evaluate_model(None, data_scaled_test, 'targets', model_type="knn")
# Logistic model
y_pred = log_reg.predict(X_test_scaled)
accuracy = accuracy_score(test_y.ravel(), y_pred)
print("Model Accuracy:", accuracy)


### **3.3 Performance Comparison**  
Analyzing **accuracy, precision, recall, and F1-score** on the test set for supervised.  

| Model                | Accuracy | Precision | Recall | F1-Score | Support |
|----------------------|----------|-----------|--------|----------|----------|
| Naive Bayes | 0.8422      | **0**: 0.8486 <br> **1**: 0.8329       | **0**: 0.8785 <br> **1**: 0.7945    | **0**: 0.8633 <br> **1**: 0.8132      |**0**: 568 <br> **1**: 433      |
| Decision Tree       | 0.9471      | **0**: 0.9813 <br> **1**: 0.9077       | **0**: 0.9243 <br> **1**: 0.9769    | **0**: 0.9519 <br> **1**: 0.9410      |**0**: 568 <br> **1**: 433      |
|    k-NN      | 0.9970      | **0**: 1.0000 <br> **1**: 0.9931       | **0**: 0.9947 <br> **1**: 1.0000    | **0**: 0.9974 <br> **1**: 0.9965      |**0**: 568 <br> **1**: 433      |
|  Logistic Regression  | 0.9770      | **0**: 0.9964 <br> **1**: 0.9535       | **0**: 0.9630 <br> **1**: 0.9954    | **0**: 0.9794 <br> **1**: 0.9740      |**0**: 568 <br> **1**: 433      |

1. k-NN achieves the highest accuracy (0.9970) with near-perfect precision, recall, and F1-score for both classes, making it the best-performing model. However, k-NN can be sensitive to noisy data and computationally expensive for large datasets.

2. Decision Tree also performs well (0.9471 accuracy) but is slightly weaker than k-NN. It has high precision and recall but may be prone to overfitting, depending on the depth of the tree.

3. Logistic Regression performs slightly better than Decision Tree, with 0.9770 accuracy. It has high precision and recall but is slightly less effective for class 1, which may indicate some bias toward class 0.

4. Naive Bayes has the lowest accuracy (0.8422) among the models, with slightly lower recall for class 1. This suggests it makes more false negatives for class 1, potentially due to its assumption of feature independence.


Analyzing **accuracy, precision, recall, and F1-score** on the test set for unsupervised.
| Model                | Accuracy | Precision | Recall | F1-Score | Support |
|----------------------|----------|-----------|--------|----------|----------|
|  k-means  | 0.44      | **0**: 0.50 <br> **1**: 0.38       | **0**: 0.46 <br> **1**: 0.42    | **0**: 0.48 <br> **1**: 0.40     |**0**: 762 <br> **1**: 610     |
|  k-means (full data)  | 0.56      | **0**: 0.61 <br> **1**: 0.50       | **0**: 0.55 <br> **1**: 0.57    | **0**: 0.58 <br> **1**: 0.53      |**0**: 762 <br> **1**: 610      |
|  DBSCAN (no Noise)  | 0.82      | **0**: 1.00 <br> **1**: 0.96       | **0**: 0.83 <br> **1**: 0.80    | **0**: 0.91 <br> **1**: 0.88      |**0**: 521 <br> **1**: 369      |


The first set of models (Naïve Bayes, Decision Tree, k-NN, Logistic Regression) achieves strong accuracy scores, ranging from 84.22% to 99.70%, while the second set (k-means, k-means Full Data, DBSCAN without Noise) performs significantly worse, with accuracy ranging from 44% to 82%.

This suggests that the first set of models is well-suited for the classification task, while the second set struggles with distinguishing between classes effectively



---



## **4.Recommendations**  

1. **Feature Engineering:**  
   - Use polynomial features to capture non-linear relationships or kernel methods for better class separation.  
2. **Ensemble Methods:**  
   - Use **Random Forest** or **Gradient Boosting** for better generalization.



---
