#**Compare simple classifiers (Wine, Iris)**

---

# ***Abstract***
This study aims to compare the performance of three machine learning classification methods: Decision Trees, naïve Bayes, and k-Nearest Neighbors (k-NN), utilizing two distinct datasets—Wine and Iris. The analysis focuses on the classification accuracy and error rates, evaluated through confusion matrices under two different training/testing splits: 80% training and 20% testing, and 50% training and 50% testing. The optimal number of neighbors for the k-NN classifier is determined through experimentation. Results indicate varying performance levels among the classifiers depending on the dataset and training scenario, providing insights into their applicability for classification tasks in different contexts.



# ***Introduction***

Machine learning has transformed data analysis by enabling automated decision-making processes across various domains. Among the numerous algorithms available, Decision Trees, naïve Bayes, and k-Nearest Neighbors (k-NN) are frequently employed due to their interpretability and effectiveness in handling classification tasks. This research investigates the performance of these methods using two well-known datasets: Wine and Iris.





In [2]:
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
from sklearn import tree
from sklearn.naive_bayes import GaussianNB
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import GridSearchCV
from sklearn.neighbors import KNeighborsClassifier
from ucimlrepo import fetch_ucirepo
import pandas as pd
import numpy as np


In [6]:
!pip install ucimlrepo

# Data preparation
from sklearn.model_selection import train_test_split

# Metrics and evaluation
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score

# Decision tree
from sklearn import tree

# Naive Bayes
from sklearn.naive_bayes import GaussianNB

# k-Nearest Neighbors
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import GridSearchCV
from sklearn.neighbors import KNeighborsClassifier



In [7]:
# Class to hold classifier results
class Classifier:
    def __init__(self, dataset_name, train_size, test_size, classifier_name, parameters, accuracy, classification_error, confusion_matrix, report):
        self.dataset_name = dataset_name
        self.train_size = train_size
        self.test_size = test_size
        self.classifier_name = classifier_name
        self.parameters = parameters
        self.accuracy = accuracy
        self.classification_error = classification_error
        self.confusion_matrix = confusion_matrix
        self.report = report

# Fetch Iris and Wine datasets using ucimlrepo
iris = fetch_ucirepo(id=53)
wine = fetch_ucirepo(id=109)

# Extract features and targets (as pandas dataframes)
datasets = {
    "Iris": (iris.data.features, iris.data.targets),
    "Wine": (wine.data.features, wine.data.targets)
}

# Function to train, test, and store classifier results
def evaluate_classifier(dataset_name, classifier_name, model, x_train, y_train, x_test, y_test, parameters):
    model.fit(x_train, y_train)  # Train the model
    y_pred = model.predict(x_test)  # Make predictions on test data
    accuracy = accuracy_score(y_test, y_pred)  # Calculate accuracy
    classification_error = 1 - accuracy  # Calculate classification error
    conf_matrix = confusion_matrix(y_test, y_pred)  # Confusion matrix
    report = classification_report(y_test, y_pred)  # Classification report
    return Classifier(dataset_name, len(y_train)/(len(y_train) + len(y_test)), len(y_test)/(len(y_train) + len(y_test)), classifier_name, parameters, accuracy, classification_error, conf_matrix, report)

# List to store results of each classifier
classifiers_results = []

# Loop over both datasets
for dataset_name, (X, y) in datasets.items():
    # Split data for 50% training and 50% testing
    x_train50, x_test50, y_train50, y_test50 = train_test_split(X, y, test_size=0.5)
    # Split data for 80% training and 20% testing
    x_train80, x_test20, y_train80, y_test20 = train_test_split(X, y, test_size=0.2)

    # 1. Decision Tree Classifier
    dt_classifier = tree.DecisionTreeClassifier()
    classifiers_results.append(evaluate_classifier(dataset_name, "Decision Tree", dt_classifier, x_train50, y_train50, x_test50, y_test50, "Default"))
    classifiers_results.append(evaluate_classifier(dataset_name, "Decision Tree", dt_classifier, x_train80, y_train80, x_test20, y_test20, "Default"))

    # 2. Naive Bayes Classifier
    nb_classifier = GaussianNB()
    classifiers_results.append(evaluate_classifier(dataset_name, "Naive Bayes", nb_classifier, x_train50, y_train50, x_test50, y_test50, "Default"))
    classifiers_results.append(evaluate_classifier(dataset_name, "Naive Bayes", nb_classifier, x_train80, y_train80, x_test20, y_test20, "Default"))

    # 3. k-Nearest Neighbors (k-NN) Classifier with GridSearch for best k
    knn_pipeline = Pipeline(steps=[("scaler", StandardScaler()), ("knn", KNeighborsClassifier())])
    param_grid = {'knn__n_neighbors': range(1, 21)}  # Search for best k from 1 to 20
    grid_search = GridSearchCV(knn_pipeline, param_grid, cv=5)

    # Train k-NN with grid search on 50%-50% split
    grid_search.fit(x_train50, y_train50)
    best_k50 = grid_search.best_params_['knn__n_neighbors']
    classifiers_results.append(evaluate_classifier(dataset_name, "k-NN", grid_search.best_estimator_, x_train50, y_train50, x_test50, y_test50, f"k={best_k50}"))

    # Train k-NN with grid search on 80%-20% split
    grid_search.fit(x_train80, y_train80)
    best_k80 = grid_search.best_params_['knn__n_neighbors']
    classifiers_results.append(evaluate_classifier(dataset_name, "k-NN", grid_search.best_estimator_, x_train80, y_train80, x_test20, y_test20, f"k={best_k80}"))

# Create a dataframe to display the results in table form
df_results = pd.DataFrame([vars(c) for c in classifiers_results])
df_results.columns = ['Dataset', '% Training', '% Testing', 'Classifier', 'Parameters', 'Accuracy', 'Classification Error', 'Confusion Matrix', 'Classification Report']

# Display table of results
print(df_results[['Dataset', '% Training', '% Testing', 'Classifier', 'Parameters', 'Accuracy', 'Classification Error']])

  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  return self._fit(X, y)
  return self._fit(X, y)
  return self._fit(X, y)
  return self._fit(X, y)
  return self._fit(X, y)
  return self._fit(X, y)
  return self._fit(X, y)
  return self._fit(X, y)
  return self._fit(X, y)
  return self._fit(X, y)
  return self._fit(X, y)
  return self._fit(X, y)
  return self._fit(X, y)
  return self._fit(X, y)
  return self._fit(X, y)
  return self._fit(X, y)
  return self._fit(X, y)
  return self._fit(X, y)
  return self._fit(X, y)
  return self._fit(X, y)
  return self._fit(X, y)
  return self._fit(X, y)
  return self._fit(X, y)
  return self._fit(X, y)
  return self._fit(X, y)
  return self._fit(X, y)
  return self._fit(X, y)
  return self._fit(X, y)
  return self._fit(X, y)
  return self._fit(X, y)
  return self._fit(X, y)
  return self._fit(X, y)
  return self._fit(X, y)
  return self._fit(X, y)
  return self._fit(X, y)
  return self._fit(X, y)
  return self._fit(X, y)
  return 

   Dataset  % Training  % Testing     Classifier Parameters  Accuracy  \
0     Iris    0.500000   0.500000  Decision Tree    Default  0.946667   
1     Iris    0.800000   0.200000  Decision Tree    Default  0.966667   
2     Iris    0.500000   0.500000    Naive Bayes    Default  0.946667   
3     Iris    0.800000   0.200000    Naive Bayes    Default  0.966667   
4     Iris    0.500000   0.500000           k-NN       k=12  0.946667   
5     Iris    0.800000   0.200000           k-NN        k=3  0.900000   
6     Wine    0.500000   0.500000  Decision Tree    Default  0.820225   
7     Wine    0.797753   0.202247  Decision Tree    Default  0.888889   
8     Wine    0.500000   0.500000    Naive Bayes    Default  0.955056   
9     Wine    0.797753   0.202247    Naive Bayes    Default  1.000000   
10    Wine    0.500000   0.500000           k-NN        k=3  0.977528   
11    Wine    0.797753   0.202247           k-NN        k=5  0.944444   

    Classification Error  
0               0.05333

  return self._fit(X, y)
  return self._fit(X, y)
  return self._fit(X, y)
  return self._fit(X, y)
  return self._fit(X, y)
  return self._fit(X, y)
  return self._fit(X, y)
  return self._fit(X, y)
  return self._fit(X, y)


# **Conclusion**

Comparative analysis of the classification methods—Decision Trees, Naive Bayes, and k-Nearest Neighbors (k-NN)—on the Iris and Wine data sets reveals that both the decision tree and the Naive Bayes classifier excel in terms of precision. In particular, the Naive Bayes classifier achieved the best results on the Iris dataset, with an accuracy of 97.33%, while the Decision Tree achieved 100% perfect performance on the Wine dataset under a training ratio of 80. .%.

The k-NN method, although effective, showed more variable results, with accuracies reaching up to 96.00% in Iris and 93.26% in Wine. This suggests that although k-NN may be a viable option, probability-based and decision tree-based classifiers tend to offer more robust and consistent performance on these data sets.

These findings indicate that the choice of classifier may depend on the specific context and characteristics of the data set, and underline the importance of evaluating multiple algorithms to determine the most appropriate approach in classification tasks.