In [2]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.cluster import KMeans
from sklearn.metrics import accuracy_score

def import_dataset(name):
    datasets = {
        "iris": "https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data",
        "glass": "https://archive.ics.uci.edu/ml/machine-learning-databases/glass/glass.data",
        "balance-scale": "https://archive.ics.uci.edu/ml/machine-learning-databases/balance-scale/balance-scale.data",
        "heart-cleveland": "https://archive.ics.uci.edu/ml/machine-learning-databases/heart-disease/processed.cleveland.data",
        "ecoli": "https://archive.ics.uci.edu/ml/machine-learning-databases/ecoli/ecoli.data"
    }
    
    # Load the datasets
    url = datasets[name]
    if name == "iris":
        df = pd.read_csv(url, header=None)
        X = df.iloc[:, :-1].values
        y = pd.Categorical(df.iloc[:, -1]).codes
    elif name == "glass":
        df = pd.read_csv(url, header=None)
        X = df.iloc[:, 1:-1].values
        y = df.iloc[:, -1].values
    elif name == "balance-scale":
        df = pd.read_csv(url, header=None)
        X = df.iloc[:, 1:].values
        y = pd.Categorical(df.iloc[:, 0]).codes
    elif name == "heart-cleveland":
        df = pd.read_csv(url, header=None, na_values="?")
        df.dropna(inplace=True)
        X = df.iloc[:, :-1].values
        y = pd.Categorical(df.iloc[:, -1]).codes
    elif name == "ecoli":
        df = pd.read_csv(url, sep=r'\s+', header=None)
        X = df.iloc[:, 1:-1].values
        y = pd.Categorical(df.iloc[:, -1]).codes
    return X, y

# Apply KNN and K-Means to multiple datasets
datasets = ["iris", "glass", "balance-scale", "heart-cleveland", "ecoli"]
k = 3

for dataset in datasets:
    print(f"Processing dataset: {dataset}")
    X, y = import_dataset(dataset)

    # Split the dataset
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

    # KNN Classifier
    knn = KNeighborsClassifier(n_neighbors=k)
    knn.fit(X_train, y_train)
    knn_predictions = knn.predict(X_test)
    knn_accuracy = accuracy_score(y_test, knn_predictions)
    print(f"KNN Accuracy for {dataset}: {knn_accuracy * 100:.2f}%")

    # K-Means Clustering
    kmeans = KMeans(n_clusters=k, random_state=42)
    kmeans.fit(X)
    kmeans_labels = kmeans.labels_

    # Map clusters to true labels for accuracy calculation
    cluster_mapping = {}
    for cluster_id in range(k):
        cluster_points = [y[i] for i in range(len(y)) if kmeans_labels[i] == cluster_id]
        if cluster_points:
            most_common_label = pd.Series(cluster_points).mode()[0]
            cluster_mapping[cluster_id] = most_common_label
    kmeans_predictions = [cluster_mapping[label] for label in kmeans_labels]
    kmeans_accuracy = accuracy_score(y, kmeans_predictions)
    print(f"K-Means Accuracy for {dataset}: {kmeans_accuracy * 100:.2f}%")
    print("===")

Processing dataset: iris
KNN Accuracy for iris: 100.00%
K-Means Accuracy for iris: 88.67%
===
Processing dataset: glass
KNN Accuracy for glass: 74.42%
K-Means Accuracy for glass: 49.53%
===
Processing dataset: balance-scale
KNN Accuracy for balance-scale: 78.40%
K-Means Accuracy for balance-scale: 64.16%
===
Processing dataset: heart-cleveland
KNN Accuracy for heart-cleveland: 48.33%
K-Means Accuracy for heart-cleveland: 53.87%
===
Processing dataset: ecoli
KNN Accuracy for ecoli: 86.76%
K-Means Accuracy for ecoli: 75.00%
===


## Q1: Names of all group members

Oscar Borén, oscbor-9@student.ltu.se  
Alexander Pettersson, aleepe-1@student.ltu.se

## Q2: Clear specification of the addressed grading criteria

"For grade 3: Develop 1 unsupervised and 1 supervised classification model for 5 datasets of your choice from 121 UCI datasets. Report accuracy results"

Supervised: K-Nearest Neighbors
Unsupervised: K-Means Clustering

Tested on iris, glass, balance-scale, heart-cleveland, and ecoli. For accuracy, see printout above.

## Q3: Description of the datasets used in the miniproject

====== 1. Iris Dataset ======  
A dataset often used in pattern recognition. It contains measurements of iris flowers from three different species.  
  
Input: Sepal length, sepal width, petal length, petal width.  
Classes: Setosa, versicolour, virginica.  
Size: 150 instances, 4 features.  
Majority percantage: 33.3%.

  
====== 2. Glass Identification Dataset ======  
Used for studying classification of glass types based on chemical composition. The goal here is to predict the type of glass based on its oxide content.  

Input: Refractive index, various oxide contents (e.g., sodium, magnesium, etc.).  
Classes: Building windows (float-processed), building windows (non-float-processed), vehicle windows (float-processed), vehicle windows (non-float-processed), containers, tableware, headlamps.  
Size: 214 patterns, 9 features.  
Majority percantage: 35.5%.

  
====== 3. Balance Scale Dataset ======  
Simulated data for balance scale weight and distance measurements. Used to predict the tilt direction of the scale.  

Input: Left weight, left distance, right weight, right distance.  
Classes: L (left tilt), B (balanced), R (right tilt).  
Size: 625 patterns, 4 features.  
Majority percantage: 46.1%.

  
====== 4. Heart Disease (Cleveland) Dataset ======  
A dataset designed to predict the presence of heart disease based on patient attributes. It contains clinical data and diagnostic results.  

Input: Age, sex, chest pain type, resting blood pressure, serum cholesterol, fasting blood sugar, resting ECG, maximum heart rate, etc.  
Classes: Diagnosis of heart disease (0: Absence, 1–4: Severity).  
Size: 303 patterns, 13 features.  
Majority percantage: 54.1%

  
====== 5. E. Coli Dataset ======  
A dataset for protein localization sites in cells of E. coli. It predicts the location of proteins based on features.  

Input: Sequence-based properties like McGeoch's signal sequence or von Heijne’s signal sequence scores.-  
Classes: Cytoplasm, inner membrane (no signal sequence), periplasm, outer membrane, etc.  
Size: 336 patterns, 7 features.  
Majority percantage: 42.6%.

## Q4: Description of the models used in the miniproject

K-Nearest Neighbors:  
    * Hyperparams.: n_clusters.  
    * Search Space: n_clusters set to 3, specifying the number of clusters to form.  

K-Means:  
    * Hyperparams.: n_neighbors and random_state.  
    * Search Space: n_neighbors is set to 3, which detirmines the number of nearest neighbors considered for classification.  
                    random_state fixed at 42 to ensure reproducibility.  

## Q5: Description of the experimental methodology (datasets' splits, cross-validation, performance metricsetc)

Dataset Splits: Datasets are split into training and testing subsets using an 80-20 split.

Cross-Validation: No real cross-validation was performed. Instead, the performance were assessed using a single train-test split. Could use k-fold cross-validation for better evaluation.

Performance Metrics:  
    * KNN: Accuracy Score was calculated as a ratio of correct labels to the total number of predictions.  
    * K-Means: K-Means clusters were mapped to the most common true label in each cluster to approximate classification accuracy.  
               Accuracy Score were once again used as the mapped predictions were compared with true labels.  

## Q6: Description of the experimental results

KNN outperformed K-Means in the majority of datasets and we think that's partly dues to the supervised nature. From K-Means' results we think that it performs better on simpler datasets like iris while suffers on more complex ones like glass and heart-cleveland. K-Means might just be less suited for this type of datasets compared to KNN though. One thing that potentially could've improved performance is a change in n_clusters where a fixed value of 3 might not have been optimal. Obviously, cross-validation could also have a positive impact on the accuracy compared to the train-test split we used. 

## Q7: Conclusions

The results show that KNN gives better accuracy on most of the chosen datasets than K-Means. This is probably because it uses labeled data, which is better in scenarios with well-separated class boundaries (e.g. 100% accuracy on Iris). However, KNN doesn't perform as well when the data is noisy or overlapping, as seen in its lower accuracy on Heart-Cleveland (48.33%). K-Means showed varying performance, performing well when clusters aligned with true labels (e.g. 88.67% on Iris) but struggling with overlapping classes (e.g. 49.53% on Glass). Overall, KNN is better suited for tasks requiring precise classification, while K-Means is useful for when labels are unavailable.