# Assignment 2: Classification and Evaluation

## Overview:
For this assignment, you will apply a number of classifiers to various datasets, and
explore various evaluation paradigms and analyze the impact of multiple parameters on the performance of the classifiers. You will then answer a number of conceptual
questions about the Naive Bayes classifier, K-nearest neighbors, and a number of baselines based on your observations. 
## Data Sets:
In this assignment, you will work with two datasets. These datasets are adapted from a UCI archive public dataset:

 - **Adult**: You predict whether an adult person earns less than 50K or 50K or more US dollar per year, based on various personal attributes like age or education level. More information can be found<a href="https://archive.ics.uci.edu/dataset/2/adult"> here </a>. 
 - **Student**: You predict a student’s final grade {A+, A, B, C, D, F} based on a number of personal and performance related attributes, such as school, parent’s education level, number of absences, etc. More information can be found<a href="https://archive.ics.uci.edu/dataset/697/predict+students+dropout+and+academic+success"> here </a>. 
 
More information about these datasets can be found in `readme.txt` file.

In [157]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import copy
import math
from collections import defaultdict, Counter
import random

In [158]:
from sklearn.preprocessing import OneHotEncoder, OrdinalEncoder
from sklearn.model_selection import train_test_split
from sklearn.dummy import DummyClassifier
from sklearn.naive_bayes import GaussianNB, BernoulliNB,CategoricalNB
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score
from sklearn.metrics import classification_report

In [159]:
import warnings

# ignore future warnings 
warnings.simplefilter(action='ignore', category=FutureWarning)

In [160]:
import sklearn
print('The scikit-learn version is {}.'.format(sklearn.__version__))

The scikit-learn version is 1.3.0.


## Question 1. Reading and Pre-processing [1.5 marks] 

**A)** First, you will read in the data using the `fileName` parameter into a pandas DataFrame. You will also need to input the list of numerical feature names `num_feat` to the function to make your pre-processing easier.

**B)** Second, you replace missing values denoted by `?` using the following two strategies: 

   * <b>Continuous features</b>: For each feature find the <b>average feature value</b> in the dataset 
   * <b>Categorical features</b>: For each feature find the <b>most frequent value</b> in the dataset  


**C)** Third, you will use one-hot encoding to convert all nominal (and ordinal) attributes to numeric. You can achieve this by either using `get_dummies()` from the pandas library or `OneHotEncoder()` from the scikit-learn library. The resulting dataset includes all originally numeric features as well as the one-hot encoded features that are now numeric, call this data `num_dataset`.

**D)** Fourth, you will use **equal-width** binning ( 4 bins ) to convert numerical features into categorical. You can achieve this by using `qcut()` from pandas library. The resulting dataset includes all originally categorical features as well as the discretized features that are now categorical, call this data `cat_dataset`.


In [113]:
# This function should read a csv file and return two pandas dataframes

def preprocess(fileName, numerical_features):
    ## read the csv file
    data = pd.read_csv(fileName, na_values='?')
    data = data.iloc[:, 1:]

    features = list(data.iloc[:,:-1].columns)
    
    # replace missing values with the most frequent for categorical features
    categorical_features = [feature for feature in features if feature not in numerical_features]

    for feature in categorical_features:
        most_freq_value = data[feature].mode()[0]
        data[feature].fillna(most_freq_value, inplace=True)
    
    # replace missing values with the average for numerical features
    for feature in numerical_features:
        average_value = data[feature].mean()
        data[feature].fillna(average_value, inplace=True)
    
    # convert categorical features to numeric using one-hot encoding
    num_dataset = pd.get_dummies(data, columns=categorical_features)
    label_column = num_dataset.pop('label')
    num_dataset['label'] = label_column

    num_bins = 4
    cat_dataset = data
    
    for feature in numerical_features:
        width_bin_range = num_dataset[feature].max() - num_dataset[feature].min()
        bin_width = width_bin_range / num_bins
        width_bin_edges = [num_dataset[feature].min() + i * bin_width for i in range(num_bins + 1)]
    
        cat_dataset[feature] = pd.cut(num_dataset[feature], bins=width_bin_edges, include_lowest=True, labels=False)
    
    return cat_dataset, num_dataset
    

In [114]:
## list of numeric features for adult dataset
adult_num = ['Age','fnlwgt','Education-num','Capital-gain','Capital-loss','Hours-per-week']

## generate the categorical and numerical adult datasets
adult_cat_dataset,adult_num_dataset = preprocess("datasets/adult.csv",adult_num)

## generate the categorical and numerical student datasets
student_cat_dataset,student_num_dataset = preprocess("datasets/student.csv",[])

In [115]:
student_dataset = pd.read_csv("datasets/student.csv", na_values='?')
adult_dataset = pd.read_csv("datasets/adult.csv", na_values='?')

#### Question 2 . Baseline methods and Discussion [4.5 marks]
**A)** For 10 rounds, use `train_test_split` to divide the processed `cat_dataset` into 80% train, 20% test . Set the `random_state` equal to the loop counter. For example in the loop
``` python 
for i in range(10):
```
make `random_state` equal to `i`. 
Use the splitted datasets to train and test the following models: **[1 mark]**

- Zero-R
- One-R
- Weighted Random 

Report the average accuracy over the 10 runs. 


In [161]:
from sklearn.dummy import DummyClassifier
import numpy as np
from collections import Counter

# Zero-R baseline
zero_r = DummyClassifier(strategy='most_frequent')

# One-R baseline
def find_best_feature(X, y):
    best_feature = None
    best_error = float('inf')
    
    for feature in range(X.shape[1]):
        unique_values = np.unique(X.iloc[:, feature])
        rules = {value: y[X.iloc[:, feature] == value].value_counts().idxmax() for value in unique_values}
        
        error = np.sum(y != np.array([rules[x[feature]] for x in X.values]))
        
        if error < best_error:
            best_feature, best_rules, best_error = feature, rules, error
            
    return best_feature, best_rules

def one_r_train(X_train, y_train):
    best_feature, best_rules = find_best_feature(X_train, y_train)
    return best_feature, best_rules

def one_r_predict(X_test, best_feature, best_rules, majority_class):
    predictions = [best_rules.get(x[best_feature], majority_class) for x in X_test.values]
    return np.array(predictions)


In [163]:
def baselines(cat_dataset):

    ZeroR_Acc_1 = []
    OneR_Acc_1 = []
    WRand_Acc_1 = []

    ## your code here
    x = cat_dataset.iloc[:,:-1]
    y = cat_dataset.iloc[:,-1]
    
    for i in range(10):
        x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=i)

        # Zero-R
        zero_r.fit(x_train, y_train)
        y_pred = zero_r.predict(x_test)
        ZeroR_Acc_1.append(accuracy_score(y_test, y_pred))

        # One-R
        best_feature, best_rules = one_r_train(x_train, y_train)
        majority_class = Counter(y_train).most_common(1)[0][0]
        one_r_predictions = one_r_predict(x_test, best_feature, best_rules, majority_class)
        OneR_Acc_1.append(accuracy_score(y_test, one_r_predictions))

        # Weighted Random
        values, counts = np.unique(y_train, return_counts=True)
        weights = counts / (sum(counts))
        weighted_random_predictions = random.choices(values, weights=weights, k=y_test.shape[0])
        WRand_Acc_1.append(accuracy_score(y_test, weighted_random_predictions))
    
    print("Accuracy of ZeroR:", np.mean(ZeroR_Acc_1).round(2))
    print("Accuracy of One-R:", np.mean(OneR_Acc_1).round(2))
    print("Accuracy of Weighted Random:", np.mean(WRand_Acc_1).round(2))
    
##Adult Dataset and Student Dataset results: 
print("Adult Dataset Baseline results:")
baselines(adult_cat_dataset)

print("Student Dataset Baseline results:")
baselines(student_cat_dataset)


Adult Dataset Baseline results:
Accuracy of ZeroR: 0.76
Accuracy of One-R: 0.77
Accuracy of Weighted Random: 0.63
Student Dataset Baseline results:
Accuracy of ZeroR: 0.3
Accuracy of One-R: 0.31
Accuracy of Weighted Random: 0.23


**B)** After comparing the performance of the different models on the classification task, please comment on any differences or lack of differences you observe between the baseline models and the datasets. **[1.5 marks]**</br>
*NOTE: You may need to compare other performance metrics of these models, such as precision and recall of each class label, to gain a better understanding of their performance. You can use the `classification_report` from `sklearn.metrics` for this matter and check the performance of the classifiers for one round.*


In [174]:
def evaluate_baselines(cat_dataset):
    def evaluate_one_round(y_true, y_pred, target_names):
        report = classification_report(y_true, y_pred, target_names=target_names, zero_division=0)
        print(report)
    
    x = cat_dataset.iloc[:,:-1]
    y = cat_dataset.iloc[:,-1]
    x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=0)

    # Zero-R
    zero_r = DummyClassifier(strategy='most_frequent')
    zero_r.fit(x_train, y_train)

    # One-R
    best_feature, best_rules = one_r_train(x_train, y_train)
    majority_class = Counter(y_train).most_common(1)[0][0]

    # Weighted Random
    values, counts = np.unique(y_train, return_counts=True)
    weights = counts / (sum(counts))

    # Predictions for each model
    zero_r_predictions = zero_r.predict(x_test)
    one_r_predictions = one_r_predict(x_test, best_feature, best_rules, majority_class)
    weighted_random_predictions = random.choices(values, weights=weights, k=y_test.shape[0])

    unique_class_labels = y_train.unique()

    # Evaluate each model's performance using classification_report
    print("Zero-R Classifier Report:")
    evaluate_one_round(y_test, zero_r_predictions, unique_class_labels)

    print("One-R Classifier Report:")
    evaluate_one_round(y_test, one_r_predictions, unique_class_labels)

    print("Weighted Random Classifier Report:")
    evaluate_one_round(y_test, weighted_random_predictions, unique_class_labels)

print("Adult Dataset Classifier Reports:")
evaluate_baselines(adult_cat_dataset)
print("Student Dataset Classifier Reports:")
evaluate_baselines(student_cat_dataset)


Adult Dataset Classifier Reports:
Zero-R Classifier Report:
              precision    recall  f1-score   support

        >50K       0.77      1.00      0.87       385
       <=50K       0.00      0.00      0.00       115

    accuracy                           0.77       500
   macro avg       0.39      0.50      0.44       500
weighted avg       0.59      0.77      0.67       500

One-R Classifier Report:
              precision    recall  f1-score   support

        >50K       0.77      1.00      0.87       385
       <=50K       1.00      0.02      0.03       115

    accuracy                           0.77       500
   macro avg       0.89      0.51      0.45       500
weighted avg       0.83      0.77      0.68       500

Weighted Random Classifier Report:
              precision    recall  f1-score   support

        >50K       0.78      0.76      0.77       385
       <=50K       0.25      0.27      0.26       115

    accuracy                           0.65       500
   macro

Adult Dataset:
For the Zero-R classifier, it predicts the majority class (>50K) for all instances, resulting in high precision, recall and f1 score for the >50K class, but low values for the '<=50K'. This is expected since it's always predicting the dominant class.
The One-R classifier performs similarly to the Zero-R classifier in terms of precision and recall for the >50K class, with improvement in predictions for '<=50K' at is creates rules based on a single feature to differentiate between classes.
The Weighted Random classifier seems to make more balanced predictions, with similar precision and recall values for both classes. However, its overall accuracy is lower (0.65) compared to the other classifiers.

Student Dataset:
For the Zero-R classifier, it's again predicting the majority class (B), leading to high precision, recall and f1 score for that class. Other classes have very low precision and recall values due to this bias, with the models accuracy being 29%.
The One-R classifier shows an improvement over Zero-R for classes like B and D in terms of precision, recall and f1 score, but struggles with classes with fewer instances like F and A+ due to the simplicity of the rules.
The Weighted Random classifier, again has balanced predictions with similar precision and recall values for most classes. However, its overall accuracy is the lowest (0.18) among the three classifiers.


For both datasets, the baseline models (Zero-R, One-R, and Weighted Random) generally achieve low performance in terms of precision, recall, and F1-score for individual classes compared to more advanced models.
There is a lack of significant differences between the baseline models in terms of their performance within each dataset. This could be due to the simplicity of the models and the underlying patterns in the data that these models are unable to capture effectively.
The Weighted Random classifier tends to have more balanced predictions for class labels in both datasets compared to the Zero-R and One-R classifiers. However, its overall accuracy is generally lower.

**C)** Update your code for One-R so that you can inspect the feature that is most often selected in the 10 rounds of training and testing for each dataset. Write the classification rule using the best feature and its values for each dataset. **[1 mark]**</br>  

In [126]:
def one_r_train_with_tracking(X_train, y_train, selected_features):
    best_feature, best_rules = find_best_feature(X_train, y_train)
    selected_features[best_feature] += 1
    return best_feature, best_rules

def baselines_with_feature_tracking(cat_dataset):
    
    selected_features = defaultdict(int)
    
    x = cat_dataset.iloc[:,:-1]
    y = cat_dataset.iloc[:,-1]
    OneR_Acc_1 = []

    
    for i in range(10):
        x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=i)

        # One-R with feature tracking
        best_feature, best_rules = one_r_train_with_tracking(x_train, y_train, selected_features)
        majority_class = Counter(y_train).most_common(1)[0][0]
        one_r_predictions = one_r_predict(x_test, best_feature, best_rules, majority_class)
        OneR_Acc_1.append(accuracy_score(y_test, one_r_predictions))
    
    print("Accuracy of One-R:", np.mean(OneR_Acc_1).round(2))
    
    # Identify the most often selected feature and its frequency
    most_selected_feature = max(selected_features, key=selected_features.get)
    print("Most often selected feature:", x.columns[most_selected_feature])
    
    # Write classification rule using the best feature and its values
    unique_values = np.unique(x_train.iloc[:, most_selected_feature])
    rule = {value: best_rules[value] for value in unique_values}
    print("Classification rule:", rule)

##Adult Dataset and Student Dataset results: 
print("Adult Dataset Baseline results:")
baselines_with_feature_tracking(adult_cat_dataset)

print("Student Dataset Baseline results:")
baselines_with_feature_tracking(student_cat_dataset)

Adult Dataset Baseline results:
Accuracy of One-R: 0.77
Most often selected feature: Capital-gain
Classification rule: {0: '<=50K', 1: '>50K', 3: '>50K'}
Student Dataset Baseline results:
Accuracy of One-R: 0.31
Most often selected feature: Fedu
Classification rule: {'high': 'C', 'low': 'D', 'mid': 'D', 'none': 'D'}


**D)** For weighted random baseline applied to Adult dataset, what would the error rate converge to (Write a formula based on the prior probability of the dominant class, named `prior`, and the fraction of test samples belonging to the dominant class, `fraction`)? **[1 mark]**

The error rate for the weighted random baseline applied to the Adult dataset would converge to the error rate of the dominant class. The formula to calculate the error rate based on the prior probability of the dominant class (prior) and the fraction of test samples belonging to the dominant class (fraction) is given by:


Error Rate = (1 - prior) * fraction + prior * (1 - fraction)



Explanation:

The first term (1 - prior) * fraction represents the error rate when the model predicts the non-dominant class incorrectly (1 - prior) and it happens for a fraction of the test samples (fraction).

The second term prior * (1 - fraction) represents the error rate when the model predicts the dominant class incorrectly (prior) and it happens for the remaining fraction of the test samples (1 - fraction).

This formula takes into account the prior probability of the dominant class and the proportion of samples in the test set that belong to the dominant class. It gives an estimate of the error rate that the weighted random baseline would converge to when applied to the Adult dataset.

## Question 3. Naive Bayes models [5 marks]

**A)** Divide the `num_dataset` and `cat_dataset` into 80% train and 20% test splits for 10 rounds, set the `random_state` equal to the loop counter. Then, train and test the following models:

- Gaussian Naive Bayes
- Bernoulli Naive Bayes
- Categorical Naive Bayes 

You must use the input data that you believe is best suited for each model. Finally, report the average accuracy of the NB models over the 10 runs. **[1 mark]**

**Note: You may need to change your input format to be able to use sklearn's CategoricalNB.** 

In [193]:
from sklearn.preprocessing import OrdinalEncoder
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.naive_bayes import GaussianNB, BernoulliNB, CategoricalNB
import numpy as np

def NB_models(num_dataset, cat_dataset):
    GNB_Acc_1 = []
    BNB_Acc_1 = []
    CNB_Acc_1 = []

    cat_dataset_X = cat_dataset.drop('label', axis=1)
    categorical_columns = cat_dataset_X.select_dtypes(include=['object']).columns.tolist()

    encoder = OrdinalEncoder()
    cat_dataset_X_encoded = cat_dataset_X.copy()
    cat_dataset_X_encoded[categorical_columns] = encoder.fit_transform(cat_dataset_X_encoded[categorical_columns])

    gnb = GaussianNB()
    bnb = BernoulliNB()
    min_categories = cat_dataset_X_encoded.nunique()
    cnb = CategoricalNB(min_categories=min_categories)
    
    for i in range(10):

        X_train_cat, X_test_cat, y_train_cat, y_test_cat = train_test_split(cat_dataset_X_encoded, cat_dataset['label'], test_size=0.2, random_state=i)
        X_train_num, X_test_num, y_train_num, y_test_num = train_test_split(num_dataset.iloc[:, :-1], num_dataset['label'], test_size=0.2, random_state=i)

        # Gaussian Naive Bayes
        gnb.fit(X_train_num, y_train_num)
        y_pred = gnb.predict(X_test_num)
        GNB_Acc_1.append(accuracy_score(y_test_num, y_pred))

        # Bernoulli Naive Bayes
        bnb.fit(X_train_num, y_train_num)
        y_pred = bnb.predict(X_test_num)
        BNB_Acc_1.append(accuracy_score(y_test_num, y_pred))

        # Categorical Naive Bayes
        cnb.fit(X_train_cat, y_train_cat)
        y_pred = cnb.predict(X_test_cat)
        CNB_Acc_1.append(accuracy_score(y_test_cat, y_pred))

    print("Accuracy of GNB:", np.mean(GNB_Acc_1).round(2))
    print("Accuracy of BNB:", np.mean(BNB_Acc_1).round(2))
    print("Accuracy of CNB:", np.mean(CNB_Acc_1).round(2))

print("Adult Dataset NB results:")
NB_models(adult_num_dataset, adult_cat_dataset)
print("\n")
print("Student Dataset NB results:")
NB_models(student_num_dataset, student_cat_dataset)


Adult Dataset NB results:
Accuracy of GNB: 0.8
Accuracy of BNB: 0.79
Accuracy of CNB: 0.8


Student Dataset NB results:
Accuracy of GNB: 0.17
Accuracy of BNB: 0.32
Accuracy of CNB: 0.34


In [194]:
from sklearn.metrics import classification_report

def NB_models_classification(num_dataset, cat_dataset):

    cat_dataset_X = cat_dataset.drop('label', axis=1)
    categorical_columns = cat_dataset_X.select_dtypes(include=['object']).columns.tolist()

    encoder = OrdinalEncoder()
    cat_dataset_X_encoded = cat_dataset_X.copy()
    cat_dataset_X_encoded[categorical_columns] = encoder.fit_transform(cat_dataset_X_encoded[categorical_columns])

    gnb = GaussianNB()
    bnb = BernoulliNB()
    min_categories = cat_dataset_X_encoded.nunique()
    cnb = CategoricalNB(min_categories=min_categories)

    X_train_cat, X_test_cat, y_train_cat, y_test_cat = train_test_split(cat_dataset_X_encoded, cat_dataset['label'], test_size=0.2, random_state=0)
    X_train_num, X_test_num, y_train_num, y_test_num = train_test_split(num_dataset.iloc[:, :-1], num_dataset['label'], test_size=0.2, random_state=0)


    # Gaussian Naive Bayes
    gnb.fit(X_train_num, y_train_num)
    y_pred_gnb = gnb.predict(X_test_num)
    print("Classification Report for Gaussian NB:")
    print(classification_report(y_test_num, y_pred_gnb))

    # Bernoulli Naive Bayes
    bnb.fit(X_train_num, y_train_num)
    y_pred_bnb = bnb.predict(X_test_num)
    print("Classification Report for Bernoulli NB:")
    print(classification_report(y_test_num, y_pred_bnb))

    # Categorical Naive Bayes
    cnb.fit(X_train_cat, y_train_cat)
    y_pred_cnb = cnb.predict(X_test_cat)
    print("Classification Report for Categorical NB:")
    print(classification_report(y_test_cat, y_pred_cnb))

# Call the function for the Adult and Student datasets
print("Adult Dataset NB results:")
NB_models_classification(adult_num_dataset, adult_cat_dataset)

print("Student Dataset NB results:")
NB_models_classification(student_num_dataset, student_cat_dataset)


Adult Dataset NB results:
Classification Report for Gaussian NB:
              precision    recall  f1-score   support

       <=50K       0.82      0.95      0.88       385
        >50K       0.64      0.32      0.43       115

    accuracy                           0.80       500
   macro avg       0.73      0.63      0.65       500
weighted avg       0.78      0.80      0.78       500

Classification Report for Bernoulli NB:
              precision    recall  f1-score   support

       <=50K       0.81      0.96      0.88       385
        >50K       0.66      0.25      0.36       115

    accuracy                           0.80       500
   macro avg       0.74      0.61      0.62       500
weighted avg       0.78      0.80      0.76       500

Classification Report for Categorical NB:
              precision    recall  f1-score   support

       <=50K       0.83      0.92      0.88       385
        >50K       0.60      0.37      0.46       115

    accuracy                       

**B)** How does the performance of the Naive Bayes classifiers compare against your baseline models for each dataset? **[1 mark]** Please comment on any differences you observe between the baseline models and the NB models in the context of the two datasets.</br> *NOTE: You may need to compare other performance metrics of these models, such as precision and recall of each class label, to gain a better understanding of their performance. You can use the `classification_report` from `sklearn.metrics` for this matter and check the performance of the classifiers for one round.*

Adult Dataset:

Zero-R Classifier: This classifier predicts the majority class (">50K") for all instances. It achieves an accuracy of 0.77, but its F1-score for the "<=50K" class is extremely low (0.00).
One-R Classifier: This classifier uses a simple rule-based approach and achieves similar accuracy (0.77) but has a low F1-score for the "<=50K" class (0.03).
Weighted Random Classifier: This random classifier performs the worst with an accuracy of 0.65.

Gaussian NB Classifier: The Gaussian NB model performs notably better than the baselines, achieving an accuracy of 0.80. It provides improved F1-scores for both classes, especially for the ">50K" class (0.43).
Bernoulli NB Classifier: The Bernoulli NB model also achieves an accuracy of 0.80 and improves the F1-scores for both classes compared to the baselines.
Categorical NB Classifier: The Categorical NB model, similar to Gaussian and Bernoulli NB, attains an accuracy of 0.80 and enhances the F1-scores for both classes.


Student Dataset:

Zero-R Classifier: This classifier predicts the majority class ("B") for all instances and achieves an accuracy of 0.29.
One-R Classifier: This classifier, like Zero-R, performs poorly with an accuracy of 0.28.
Weighted Random Classifier: The random classifier performs slightly better with an accuracy of 0.18.

Gaussian NB Classifier: The Gaussian NB model significantly improves the accuracy to 0.12 compared to the baselines, but the F1-scores remain relatively low for most classes.
Bernoulli NB Classifier: The Bernoulli NB model also shows improved accuracy (0.26), but like Gaussian NB, the F1-scores are relatively low for many classes.
Categorical NB Classifier: Similar to Gaussian and Bernoulli NB, the Categorical NB model achieves an accuracy of 0.29, with limited improvement in F1-scores.

Differences Observed:

In both datasets, the Naive Bayes models consistently outperform the baselines in terms of accuracy.
The Naive Bayes models generally provide better F1-scores for both classes in the Adult dataset indicating improved precision and recall.
The Student Dataset presents a more challenging problem, and even the Naive Bayes models struggle to achieve high F1-scores for most classes. However, they still offer some improvement over the baselines.

**C)** The three Naive Bayes (NB) classifiers lead to different performances. Which of these NB classifiers performs best for each dataset, and why do you think it is the case? **[1 mark]** *NOTE: You may need to compare other performance metrics of these models, such as precision and recall of each class label, to gain a better understanding of their performance. You can use the `classification_report` from `sklearn.metrics` for this matter and check the performance of the classifiers for one round.*




Adult Dataset:

Gaussian NB Classifier performs the best for the Adult Dataset. It achieves an accuracy of 0.80, the highest among the NB models. It also provides reasonably balanced F1-scores for both classes ("<=50K" and ">50K").
Bernoulli NB Classifier also performs well with an accuracy of 0.80 but has slightly lower F1-scores compared to Gaussian NB.
Categorical NB Classifier has an accuracy of 0.80, but has slightly lower F1-scores than Gaussian and Bernoulli NB for both classes.

Student Dataset:

Gaussian NB Classifier has an accuracy of 0.12 and performs poorly with low precision, recall, and F1-scores for most classes, indicating challenges in correctly classifying various categories. Bernoulli NB Classifier performs better with an accuracy of 0.26 but still faces difficulties in accurately classifying several classes. Categorical NB Classifier has an accuracy of 0.29 but struggles to accurately classify multiple classes, leading to moderate precision, recall, and F1-scores for various categories.

Some possible reasons for this are:

Feature Independence Assumption:
All NB classifiers make the naive assumption of feature independence, which may or may not hold true in the real-world data. The suitability of this assumption can impact performance.

Data Complexity:
The complexity and distribution of data in the Student Dataset are quite different from the Adult Dataset. The Student Dataset appears to be more challenging, with a larger number of classes and potentially non-linear relationships between features and the target variable.

**D)** The Gaussian Naive Bayes classifier makes two fundamental assumptions: (1) about the distribution of $P(x_j|c_i)$ and (2) about the (conditional) dependency structure between features.
Explain both assumptions, and discuss whether these assumptions are always true for the
numeric attributes in the Adult dataset. If applicable, identify some cases where the assumptions are violated. **[2 marks]**


The Gaussian Naive Bayes (GNB) classifier makes two fundamental assumptions:

Distribution of P(xj|ci): GNB assumes that the numerical attributes (xj) of each class (ci) follow a Gaussian (normal) distribution.

Conditional Dependency Structure between features: GNB assumes that the features are conditionally independent given the class label, which means that the value of any one feature does not depend on the values of other features, given the class label. 

P(x1, x2, ..., xn|ci) = P(x1|ci) * P(x2|ci) * ... * P(xn|ci).


The assumptions of GNB may not always hold true for the numeric attributes in the Adult dataset:

Distribution Assumption (Gaussian Distribution):While some features might have distributions close to Gaussian, others could have more complex or skewed distributions.
For example, attributes like "capital-gain" and "capital-loss" are likely to have distributions with most values being zeros or near zero and a few extremely high values. 


Conditional Independence Assumption:The assumption of conditional independence might be violated in cases where there are strong correlations or relationships between numeric attributes. 
For example, Income ("income") might be strongly correlated with education level ("education-num") or occupation ("occupation").


In summary, while GNB's assumptions are simplifications that can work well for certain datasets, the numeric attributes in the Adult dataset include features violate these assumptions, and therefore GNB may not be suitable for this dataset.

## Question 4. K-Nearest Neighbor [3 marks] 
**A)** Divide the `num_dataset` into 80% train and 20% test splits for 10 rounds, set the `random_state` equal to the loop counter. Then, train and test the following models:

- 6 K-Nearest Neighbor models with Euclidean distance using the following parameters:

    - with K values of 1,5, and 10
    
    - using inverse distance weighting and majority voting 

Finally, report the average accuracy of the KNN models over the 10 rounds. **[1 mark]**

In [188]:
def KNNs(num_dataset):
    KNN1_Acc_1_weighted = []
    KNN5_Acc_1_weighted = []
    KNN10_Acc_1_weighted = []
    KNN1_Acc_1_majority = []
    KNN5_Acc_1_majority = []
    KNN10_Acc_1_majority = []


    ## your code here
    for i in range(10):
        X_train, X_test, y_train, y_test = train_test_split(num_dataset.iloc[:,:-1], num_dataset.iloc[:,-1], test_size=0.2, random_state=i)
        
        for k in [1, 5, 10]:
            # Weighted KNN with inverse distance weighting
            knn_weighted = KNeighborsClassifier(n_neighbors=k, weights='distance', metric='euclidean')
            knn_weighted.fit(X_train.values, y_train)
            y_pred = knn_weighted.predict(X_test.values)
            acc_weighted = accuracy_score(y_test, y_pred)
            
            # KNN with majority voting
            knn_majority = KNeighborsClassifier(n_neighbors=k, weights='uniform', metric='euclidean')
            knn_majority.fit(X_train.values, y_train)
            y_pred = knn_majority.predict(X_test.values)
            acc_majority = accuracy_score(y_test, y_pred)

            if k == 1:
                KNN1_Acc_1_weighted.append(acc_weighted)
                KNN1_Acc_1_majority.append(acc_majority)
            elif k == 5:
                KNN5_Acc_1_weighted.append(acc_weighted)
                KNN5_Acc_1_majority.append(acc_majority)
            elif k == 10:
                KNN10_Acc_1_weighted.append(acc_weighted)
                KNN10_Acc_1_majority.append(acc_majority)

            
    print("Accuracy of weighted KNN(1):", np.mean(KNN1_Acc_1_weighted).round(2))
    print("Accuracy of weighted KNN(5):", np.mean(KNN5_Acc_1_weighted).round(2))
    print("Accuracy of weighted KNN(10):", np.mean(KNN10_Acc_1_weighted).round(2))
    print("Accuracy of KNN(1):", np.mean(KNN1_Acc_1_majority).round(2))
    print("Accuracy of KNN(5):", np.mean(KNN5_Acc_1_majority).round(2))
    print("Accuracy of KNN(10):", np.mean(KNN10_Acc_1_majority).round(2))
    
    
##Adult Dataset and Student Dataset results: 

print("Adult Dataset KNN results:")
KNNs(adult_num_dataset)
print("\n")
print("Student Dataset KNN results:")
KNNs(student_num_dataset)

  
    

Adult Dataset KNN results:
Accuracy of weighted KNN(1): 0.7
Accuracy of weighted KNN(5): 0.74
Accuracy of weighted KNN(10): 0.76
Accuracy of KNN(1): 0.7
Accuracy of KNN(5): 0.78
Accuracy of KNN(10): 0.79


Student Dataset KNN results:
Accuracy of weighted KNN(1): 0.26
Accuracy of weighted KNN(5): 0.28
Accuracy of weighted KNN(10): 0.29
Accuracy of KNN(1): 0.26
Accuracy of KNN(5): 0.27
Accuracy of KNN(10): 0.28


**B)** Compare the results of the weighted and majority KNN models (for each value of K) and explain any differences you observe for each dataset in terms of the voting strategy and the number of nearest neighbors. **[1 marks]**</br> *NOTE: You may need to compare other performance metrics of these models, such as precision and recall of each class label, to gain a better understanding of their performance. You can use the `classification_report` from `sklearn.metrics` for this matter and check the performance of the classifiers for one round.* 

*Answer Here*

In [192]:
from sklearn.metrics import classification_report
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier

def knn_classifier_report(num_dataset):
    X_train, X_test, y_train, y_test = train_test_split(num_dataset.iloc[:,:-1], num_dataset.iloc[:,-1], test_size=0.2, random_state=1)

    for k in [1, 5, 10]:

        knn_weighted = KNeighborsClassifier(n_neighbors=k, weights='distance', metric='euclidean')
        knn_weighted.fit(X_train.values, y_train)
        y_pred_weighted = knn_weighted.predict(X_test.values)

        knn_majority = KNeighborsClassifier(n_neighbors=k, weights='uniform', metric='euclidean')
        knn_majority.fit(X_train.values, y_train)
        y_pred_majority = knn_majority.predict(X_test.values)

        print(f"Classification Report for Weighted KNN (K={k}):")
        print(classification_report(y_test, y_pred_weighted, zero_division=0))

        print(f"Classification Report for Majority KNN (K={k}):")
        print(classification_report(y_test, y_pred_majority, zero_division=0))

knn_classifier_report(adult_num_dataset)
knn_classifier_report(student_num_dataset)


Classification Report for Weighted KNN (K=1):
              precision    recall  f1-score   support

       <=50K       0.79      0.81      0.80       381
        >50K       0.35      0.33      0.34       119

    accuracy                           0.70       500
   macro avg       0.57      0.57      0.57       500
weighted avg       0.69      0.70      0.69       500

Classification Report for Majority KNN (K=1):
              precision    recall  f1-score   support

       <=50K       0.79      0.81      0.80       381
        >50K       0.35      0.33      0.34       119

    accuracy                           0.70       500
   macro avg       0.57      0.57      0.57       500
weighted avg       0.69      0.70      0.69       500

Classification Report for Weighted KNN (K=5):
              precision    recall  f1-score   support

       <=50K       0.79      0.88      0.83       381
        >50K       0.39      0.25      0.31       119

    accuracy                           0.73 

Adult Dataset:

K=1: Both Weighted KNN and Majority KNN have the same accuracy (0.7), therefore no difference with the voting strategy.

K=5: Weighted KNN has a slightly higher accuracy (0.78) compared to Majority KNN (0.73).

K=10: Weighted KNN has a slightly higher accuracy (0.79) compared to Majority KNN (0.78)

The majority KNN tends to perform slightly better in terms of accuracy across different values of K. This suggests that the majority voting strategy is more suitable for this dataset, where the majority class ("<=50K") is more prevalent.

As the number of nearest neighbors (K) increases, both models show a trend of increasing accuracy, with the majority KNN consistently having a slight advantage in accuracy. This aligns with the concept that increasing K tends to reduce the impact of noise on predictions.


Student Dataset:

K=1: Both Weighted KNN and Majority KNN have the same accuracy (0.26), therefore no difference with the voting strategy.

K=5: Weighted KNN has a slightly higher accuracy (0.28) compared to Majority KNN (0.27)

K=10: Weighted KNN has a slightly higher accuracy (0.29) compared to Majority KNN (0.28)

Both weighted and majority KNN models have similar accuracy values, with KNN models with inverse distance weighting performing slightly better. This indicates that the voting strategy has less of an impact on the performance with both models struggle to achieve high accuracy, which suggests that the dataset's characteristics pose challenges for both voting strategies.

Similar to the Adult Dataset, increasing the number of nearest neighbors generally leads to improved accuracy for both models. However, the improvements are relatively small.

**C)** How would standardisation impact the performance of your KNN models and Gaussian Naive Bayes model for the Adult dataset? **[1 marks]**

K-Nearest Neighbor (KNN) Models:
Standardization can have a significant impact on the performance of KNN models, particularly when using distance-based metrics like Euclidean distance. Standardization rescales the features so that they have a mean of zero and a standard deviation of one. This transformation ensures that all features are on the same scale, preventing features with larger magnitudes from dominating the distance calculations. The impact of standardization on KNN with majority voting might be less pronounced cas it does not consider the distance between neighbors. However, standardization can still help ensure that features contribute equally to the classification process, potentially leading to a more balanced and accurate model.


Gaussian Naive Bayes (GNB) Model:
GNB models assume that features follow a Gaussian distribution and standardization could help align the features with this Gaussian distribution assumption. This helps the GNB model perform better because it aligns with how the data is expected to be distributed.

Therefore, standardization can have positive effects on the performance of KNN models and the GNB model for the Adult dataset. It can contribute to improved accuracy, precision, and recall by ensuring balanced feature scales and aligning with the assumptions of the algorithms.

## Question 5. Evaluation metrics [2 marks]

**A)** Update the code in questions 2, 3, and 4 to compute the following metrics for the models listed below:

- One-R 
- Gaussian NB 
- Categorical NB
- 3-Nearest Neighbor model with Euclidean distance and majority voting 

Report their performance using the following two metrics
- micro-averaged precision
- macro-averaged precision 
 
Conversely, you can also choose to implement the same 10 rounds of train and test split (80% train, 20% test) as described in the questions 2,3, and 4 in the code block below and report the average scores for the micro-precision and macro-precision.

**[0.5 marks]**

In [187]:
from sklearn.metrics import precision_score, make_scorer

def compare_eval(num_dataset, cat_dataset):

    OneR_microP_1 = []
    GNB_microP_1 = []
    CNB_microP_1 = []
    KNN3_microP_1_majority = []

    OneR_macroP_1 = []
    GNB_macroP_1 = []
    CNB_macroP_1 = []
    KNN3_macroP_1_majority = []

    OneR_Acc_1 = []
    GNB_Acc_1 = []
    CNB_Acc_1 = []
    KNN3_Acc_1_majority = []
    
    for i in range(10):

        X = cat_dataset.drop('label', axis=1)
        y = cat_dataset['label']

        categorical_columns = X.select_dtypes(include=['object']).columns.tolist()

        encoder = OrdinalEncoder()
        X_encoded = X.copy()
        X_encoded[categorical_columns] = encoder.fit_transform(X_encoded[categorical_columns])

        X_train, X_test, y_train, y_test = train_test_split(X_encoded, y, test_size=0.2, random_state=i)
        X_train_num, X_test_num, y_train_num, y_test_num = train_test_split(
            num_dataset.iloc[:, :-1], num_dataset['label'], test_size=0.2, random_state=i)

        # One-R
        best_feature, best_rules = one_r_train(X_train_num, y_train_num)
        majority_class = Counter(y_train).most_common(1)[0][0]
        one_r_predictions = one_r_predict(X_test_num, best_feature, best_rules, majority_class)
        one_r_micro_precision = precision_score(y_test_num, one_r_predictions, average='micro', zero_division=0)
        one_r_macro_precision = precision_score(y_test_num, one_r_predictions, average='macro', zero_division=0)
        OneR_Acc_1.append(accuracy_score(y_test, one_r_predictions))
        OneR_microP_1.append(one_r_micro_precision)
        OneR_macroP_1.append(one_r_macro_precision)

        # Gaussian Naive Bayes
        gnb = GaussianNB()
        gnb.fit(X_train_num, y_train_num)
        y_pred = gnb.predict(X_test_num)
        gnb_micro_precision = precision_score(y_test_num, y_pred, average='micro', zero_division=0)
        gnb_macro_precision = precision_score(y_test_num, y_pred, average='macro', zero_division=0)
        GNB_Acc_1.append(accuracy_score(y_test, y_pred))
        GNB_microP_1.append(gnb_micro_precision)
        GNB_macroP_1.append(gnb_macro_precision)

        # Categorical Naive Bayes
        min_categories = X_encoded.nunique()
        cnb = CategoricalNB(min_categories=min_categories)
        cnb.fit(X_train, y_train)
        y_pred = cnb.predict(X_test)
        cnb_micro_precision = precision_score(y_test, y_pred, average='micro', zero_division=0)
        cnb_macro_precision = precision_score(y_test, y_pred, average='macro', zero_division=0)
        CNB_Acc_1.append(accuracy_score(y_test, y_pred))
        CNB_microP_1.append(cnb_micro_precision)
        CNB_macroP_1.append(cnb_macro_precision)

        # 3-Nearest Neighbor model
        knn_model = KNeighborsClassifier(n_neighbors=3, weights='uniform', metric='euclidean')
        knn_model.fit(X_train_num.values, y_train_num)
        y_pred_knn = knn_model.predict(X_test_num.values)
        knn_micro_precision = precision_score(y_test, y_pred_knn, average='micro', zero_division=0)
        knn_macro_precision = precision_score(y_test, y_pred_knn, average='macro', zero_division=0)
        KNN3_Acc_1_majority.append(accuracy_score(y_test, y_pred_knn))
        KNN3_microP_1_majority.append(knn_micro_precision)
        KNN3_macroP_1_majority.append(knn_macro_precision)


    print("Accuracy of One-R:", np.mean(OneR_Acc_1).round(2))
    print("Accuracy of GNB:", np.mean(GNB_Acc_1).round(2))
    print("Accuracy of CNB:", np.mean(CNB_Acc_1).round(2)) 
    print("Accuracy of KNN(3):", np.mean(KNN3_Acc_1_majority).round(2))

    print("Micro-p of One-R:", np.mean(OneR_microP_1).round(2))
    print("Micro-p of GNB:", np.mean(GNB_microP_1).round(2))
    print("Micro-p of CNB:", np.mean(CNB_microP_1).round(2)) 
    print("Micro-p of KNN(3):", np.mean(KNN3_microP_1_majority).round(2))

    print("Macro-p of One-R:", np.mean(OneR_macroP_1).round(2))
    print("Macro-p of GNB:", np.mean(GNB_macroP_1).round(2))
    print("Macro-p of CNB:", np.mean(CNB_macroP_1).round(2)) 
    print("Macro-p of KNN(3):", np.mean(KNN3_macroP_1_majority).round(2))
    

##Adult Dataset and Student Dataset results: 

print("Adult Dataset Evaluation results:")
compare_eval(adult_num_dataset,adult_cat_dataset)
print("\n")
print("Student Dataset Evaluation results:")
compare_eval(student_num_dataset,student_cat_dataset)    
    

Adult Dataset Evaluation results:
Accuracy of One-R: 0.76
Accuracy of GNB: 0.8
Accuracy of CNB: 0.8
Accuracy of KNN(3): 0.76
Micro-p of One-R: 0.76
Micro-p of GNB: 0.8
Micro-p of CNB: 0.8
Micro-p of KNN(3): 0.76
Macro-p of One-R: 0.6
Macro-p of GNB: 0.74
Macro-p of CNB: 0.73
Macro-p of KNN(3): 0.64


Student Dataset Evaluation results:
Accuracy of One-R: 0.31
Accuracy of GNB: 0.17
Accuracy of CNB: 0.34
Accuracy of KNN(3): 0.23
Micro-p of One-R: 0.31
Micro-p of GNB: 0.17
Micro-p of CNB: 0.34
Micro-p of KNN(3): 0.23
Macro-p of One-R: 0.1
Macro-p of GNB: 0.21
Macro-p of CNB: 0.27
Macro-p of KNN(3): 0.26


**B)** Compare the average accuracy vs. macro-average and micro-average precision for the two datasets. Explain which evaluation measurement would be most appropriate for each dataset **[1.5 mark]**



Adult Dataset:

The accuracy values for the classifiers (One-R, GNB, CNB, KNN(3)) range from 0.76 to 0.8, indicating relatively high accuracy across the models. The micro-averaged precision values are consistent with the accuracy values, ranging from 0.76 to 0.8. The macro-averaged precision values are lower than the micro-averaged precision and accuracy values, ranging from 0.6 to 0.74.

Appropriate Evaluation Metric: Micro-averaged Precision

Since micro-averaged precision considers both true positives and false positives across all classes and then computes a single precision value, it aligns well with the balanced nature of the dataset and the high accuracy values. This metric provides an overall understanding of how well the models are performing in terms of precision.

Student Dataset:

The accuracy values for the classifiers range from 0.17 to 0.34, indicating relatively low accuracy across the models. The micro-averaged precision values are consistent with the accuracy values, ranging from 0.17 to 0.34. The macro-averaged precision values are also relatively low, ranging from 0.1 to 0.27.

Appropriate Evaluation Metric: Macro-averaged Precision

Macro-averaged precision calculates the precision for each class individually and then takes their average, giving equal weight to each class. Since the dataset is imbalanced(due to relatively low accuracy and micro-averaged precision) and there are variations in class distributions, using macro-averaged precision helps prevent the dominant class from overshadowing the performance of minority classes. This metric provides insights into how well the models are performing across different classes.


## Question 6. Ethics and implications in practice [4 marks]

The Categorical Naive Bayes classifier you developed in this assignment for the student dataset could for example be used to classify college applicants into admitted vs not-admitted depending on their predicted grade in the Student dataset.

**A)** Discuss ethical problems which might arise in this application and lead to unfair treatment of the applicants. Ground your discussion in the set of features provided in the student data set.**[1 marks]**



Some ethical problems that may arise in this application and lead to unfair treatment of applicants are:

##### Gender Bias: 
The 'sex' feature can lead to gender bias, with the model incorrectly associating a certain gender with having a higher or lower grade.


##### Socioeconomic Bias:
The 'Medu', 'Fedu', 'Mjob', and 'Fjob' features can lead to a socioeconomic bias where students from families with more educated parents and/or well reputated and high paying jobs could be favored, without considering the challenges faced by students who are not from these backgrounds.


##### Educational Access Bias:
The 'internet' feature can correlate to access to educational resources, and if the model incorrectly values this attibute, this could result in an unfair treatment to students lacking access to the resource.


##### Health and Personal Circumstances Bias:
Features like 'health' and 'absences', which can often be as a result of personal circumstances, and the model can inaccurately treat these features as a measure of the students commitment, and academic potential, it could lead to unfair decisions for students with health conditions and absenses due to personal circumstances.

**B)** Remove all ethically problematic features from the data set (use your own judgment), and train your Naive Bayes classifiers on the resulting data set. How does the performance change in comparison to the full classifier ( consider accuracy and micro-average precision)?**[2 marks]**



In [156]:
unethical_features = ['sex', 'Medu', 'Fedu', 'internet', 'romantic', 'health', 'famsize', 'Pstatus', 'guardian', 'famrel',
                      'absences', 'Mjob','guardian', 'famrel']

filtered_student_cat_dataset = student_cat_dataset.drop(columns=unethical_features)

print("Classifier without ethically problematic features:")
NB_models(student_num_dataset, filtered_student_cat_dataset)


Classifier without ethically problematic features:
Accuracy of GNB: 0.17
Accuracy of BNB: 0.32
Accuracy of CNB: 0.34


In [None]:
Removing the ethically problematic features does not affect the accuracies as compared to the original dataset.

**C)** The approach to fairness we have adopted is called “fairness through unawareness”, where we simply deleted any questionable features from our data. Is removing all problematic features as done in part (b) guarantee a fair classifier? Explain Why or Why not?**[1 marks]**


Removing all problematic features does not guarantee a fair classifier. While removing problematic features can mitigate some potential sources of bias and discrimination, it does not address the underlying issues that might still exist in the remaining features. 

Some reasons why removing problematic features might not guarantee fairness:

Proxy Features: Sometimes, problematic features can be proxies for other characteristics that are still present in the dataset. Even if the direct features are removed, the information they were capturing might still be present in other correlated features.

Hidden Bias: Bias can still be present in the remaining features that were not identified as problematic. The model can learn hidden biases from the data itself, even if those biases were not explicitly represented by the removed features.

In order to achieve fairness, more advanced techniques that specifically focus on identifying and mitigating bias in the data and model like re-sampling, re-weighting, adversarial training, and fairness-aware loss functions should be used. 

# Authorship Declaration:

   (1) I certify that the program contained in this submission is completely
   my own individual work, except where explicitly noted by comments that
   provide details otherwise.  I understand that work that has been developed
   by another student, or by me in collaboration with other students,
   or by non-students as a result of request, solicitation, or payment,
   may not be submitted for assessment in this subject.  I understand that
   submitting for assessment work developed by or in collaboration with
   other students or non-students constitutes Academic Misconduct, and
   may be penalized by mark deductions, or by other penalties determined
   via the University of Melbourne Academic Honesty Policy, as described
   at https://academicintegrity.unimelb.edu.au.

   (2) I also certify that I have not provided a copy of this work in either
   softcopy or hardcopy or any other form to any other student, and nor will
   I do so until after the marks are released. I understand that providing
   my work to other students, regardless of my intention or any undertakings
   made to me by that other student, is also Academic Misconduct.

   (3) I further understand that providing a copy of the assignment
   specification to any form of code authoring or assignment tutoring
   service, or drawing the attention of others to such services and code
   that may have been made available via such a service, may be regarded
   as Student General Misconduct (interfering with the teaching activities
   of the University and/or inciting others to commit Academic Misconduct).
   I understand that an allegation of Student General Misconduct may arise
   regardless of whether or not I personally make use of such solutions
   or sought benefit from such actions.

   <b>Signed by</b>: Kian Dsouza
   
   <b>Dated</b>: 01-09-2023