# Machine Learning - Assignment 1

## Naive Bayes Learning algorithm, Cross-validation, and ROC-Curves

The aim of the assignment is to implement:

* Naive Bayes learning algorithm for binary classification tasks
* Visualization to plot a ROC-curve
* A cross-validation test
* Visualization of the average ROC-curve of a cross-validation test

Follow the instructions and implement what is missing to complete the assignment. Some functions have been started to help you a little bit with the inputs or outputs of the function.

**Note:** You might need to go back and forth during your implementation of the code. The structure is set up to make implementation easier, but how you return values from the different functions might vary, and you might find yourself going back and change something to make it easier later on.

## Assignment preparations

We help you out with importing the libraries and reading the data.

Look at the output to get an idea of how the data is structured.

In [12]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from math import e, pi, sqrt

In [23]:
class Data_set:
    def __init__(self):
        self.data = None
        self.features = None
        self.labels = None
        self.means = 0.0
        self.prior = 0.0
        self.std_devs = 0.0
        self.gaussian_probability_density = 0.0

    def fix_data_structure(self):
        if all(isinstance(row, np.ndarray) for row in self.features):
            try:
                self.features = np.array(self.features, dtype=float)
            except ValueError as e:
                print("Erreur lors de la conversion :", e)
        else:
            print("Les éléments ne sont pas tous des tableaux numpy.")

    def class_split(self):
        self.features = self.data[:, :-1]
        self.labels = self.data[:, -1]
        self.fix_data_structure()

    def display_features_and_labels(self):
        print("Features set:")
        print(self.features)
        print("Labels set:")
        print(self.labels)

In [24]:
# creating a class to make the code cleaner
class Flower:
    def __init__(self):
        self.data = None
        self.train = Data_set()
        self.test = Data_set()
        self.test_size = 0.2

    def train_test_split(self):
        np.random.shuffle(data)
        
        split_index = int(len(self.data) * (1 - self.test_size))

        self.train.data = self.data[:split_index]
        self.test.data = self.data[split_index:]

    def class_split_automation(self):
        self.train.class_split()
        self.test.class_split()
        self.train.display_features_and_labels()
        self.test.display_features_and_labels()

    def display_training(self):
        print("Train set:")
        print(self.train.data[:3])
        print("Test set:")
        print(self.test.data[:3])

In [25]:
data = pd.read_csv("iris.csv").to_numpy()

mapped, index, unique_arr = np.unique(data[:, -1], return_index=True, return_inverse=True)
data[:, -1] = unique_arr

iris_setosa = Flower()
iris_versicolor = Flower()
iris_virginica = Flower()

iris_setosa.data, iris_versicolor.data, iris_virginica.data = np.split(data, index[1:])

print(f"Full data array (features and labels):\n{iris_setosa.data[:3]}\n")
print("###############\n")
print(f"Train features (first 4 columns):\n{iris_setosa.data[:3, :-1]}\n")
print(f"Labels (last column):\n{iris_setosa.data[:3, -1:]}\n")
print(f"Names of labels:\n{[[numb, name] for numb, name in enumerate(mapped)]}")

Full data array (features and labels):
[[5.1 3.5 1.4 0.2 0]
 [4.9 3.0 1.4 0.2 0]
 [4.7 3.2 1.3 0.2 0]]

###############

Train features (first 4 columns):
[[5.1 3.5 1.4 0.2]
 [4.9 3.0 1.4 0.2]
 [4.7 3.2 1.3 0.2]]

Labels (last column):
[[0]
 [0]
 [0]]

Names of labels:
[[0, 'Iris-setosa'], [1, 'Iris-versicolor'], [2, 'Iris-virginica']]


In [26]:
# Example print of the 3 first datapoints (similar as above):
iris_setosa.data[:3]

array([[5.1, 3.5, 1.4, 0.2, 0],
       [4.9, 3.0, 1.4, 0.2, 0],
       [4.7, 3.2, 1.3, 0.2, 0]], dtype=object)

## Data handling functions

As a start, we are going to implement some basic data handling functions to use in the future.

### 1) Split class into a train and test set

First, we need to be able to split the class into a train and test set.

In [27]:
# TODO: Test the train_test_split function
iris_setosa.train_test_split()
iris_versicolor.train_test_split()
iris_virginica.train_test_split()

# TODO: Print the output
iris_setosa.display_training()
iris_versicolor.display_training()
iris_virginica.display_training()

Train set:
[[4.6 3.4 1.4 0.3 0]
 [4.9 2.4 3.3 1.0 1]
 [5.8 2.6 4.0 1.2 1]]
Test set:
[[6.0 2.7 5.1 1.6 1]
 [5.1 2.5 3.0 1.1 1]
 [5.0 3.0 1.6 0.2 0]]
Train set:
[[6.4 2.7 5.3 1.9 2]
 [5.2 4.1 1.5 0.1 0]
 [6.0 3.4 4.5 1.6 1]]
Test set:
[[6.3 3.4 5.6 2.4 2]
 [4.4 2.9 1.4 0.2 0]
 [6.7 3.1 5.6 2.4 2]]
Train set:
[[6.2 2.8 4.8 1.8 2]
 [5.5 2.4 3.7 1.0 1]
 [6.2 3.4 5.4 2.3 2]]
Test set:
[[5.1 3.8 1.9 0.4 0]
 [5.7 3.8 1.7 0.3 0]
 [6.6 2.9 4.6 1.3 1]]


### 2) Split data into features and labels

The data as shown above is not always the optimal shape. To help us keep track of things, we can split the data into its features and labels seperately.

Each class is 4 features and 1 label in the same array: 

* **[feature 1, feature 2, feature 3, feature 4, label]**

It would help us later to have the features and labels in seperate arrays in the form: 

* **[feature 1, feature 2, feature 3, feature 4]** and **[label]**

Here you are going to implement this functionallity.

We should first test the "**class_split**" function on one of the classes above (iris_setosa, etc...) to make sure it works properly.

In [28]:
# TODO: Test the class splitting function
# iris_setosa.train.class_split()
# iris_setosa.test.class_split()

# iris_versicolor.train.class_split()
# iris_versicolor.test.class_split()

# iris_virginica.train.class_split()
# iris_virginica.test.class_split()

# TODO: Print the output
# iris_setosa.train.display_features_and_labels()
# iris_setosa.test.display_features_and_labels()

# iris_versicolor.train.display_features_and_labels()
# iris_versicolor.test.display_features_and_labels()

# iris_virginica.train.display_features_and_labels()
# iris_virginica.test.display_features_and_labels()

# Or 

iris_setosa.class_split_automation()
iris_versicolor.class_split_automation()
iris_virginica.class_split_automation()

Features set:
[[4.6 3.4 1.4 0.3]
 [4.9 2.4 3.3 1. ]
 [5.8 2.6 4.  1.2]
 [7.2 3.2 6.  1.8]
 [5.5 2.6 4.4 1.2]
 [6.  2.2 5.  1.5]
 [5.7 2.8 4.1 1.3]
 [5.7 2.5 5.  2. ]
 [5.6 3.  4.1 1.3]
 [6.7 3.1 4.4 1.4]
 [5.1 3.5 1.4 0.3]
 [7.9 3.8 6.4 2. ]
 [6.3 3.3 4.7 1.6]
 [5.5 2.4 3.8 1.1]
 [7.7 3.  6.1 2.3]
 [5.4 3.  4.5 1.5]
 [7.3 2.9 6.3 1.8]
 [5.4 3.4 1.7 0.2]
 [5.5 3.5 1.3 0.2]
 [5.1 3.7 1.5 0.4]
 [7.7 3.8 6.7 2.2]
 [5.7 2.8 4.5 1.3]
 [5.9 3.  4.2 1.5]
 [6.1 2.6 5.6 1.4]
 [6.9 3.1 5.4 2.1]
 [6.8 2.8 4.8 1.4]
 [6.5 3.  5.8 2.2]
 [5.6 3.  4.5 1.5]
 [4.8 3.4 1.9 0.2]
 [4.7 3.2 1.6 0.2]
 [6.4 2.9 4.3 1.3]
 [5.2 3.5 1.5 0.2]
 [5.7 2.6 3.5 1. ]
 [6.3 3.3 6.  2.5]
 [5.9 3.2 4.8 1.8]
 [6.  3.  4.8 1.8]
 [5.4 3.9 1.7 0.4]
 [5.5 4.2 1.4 0.2]
 [6.4 3.2 5.3 2.3]
 [7.4 2.8 6.1 1.9]]
Labels set:
[0 1 1 2 1 2 1 2 1 1 0 2 1 1 2 1 2 0 0 0 2 1 1 2 2 1 2 1 0 0 1 0 1 2 1 2 0
 0 2 2]
Features set:
[[6.  2.7 5.1 1.6]
 [5.1 2.5 3.  1.1]
 [5.  3.  1.6 0.2]
 [7.1 3.  5.9 2.1]
 [6.7 3.  5.  1.7]
 [7.2 3.  5.8 1.6]
 [

We should also try to **1)** first split a class into a train and test set, **2)** split each of these two into features and abels. In total there should be 4 arrays (2 feature and 2 label arrays).

## Naive Bayes learning algorithm

When implementing the Navie Bayes learning algorithm, we can break it down into a few components.

We will implement these components one at a time.

### 3) Calculate feature statistics

First, we need to implement a function that returns feature statistics (means, standard deviation, priors) for a given set of feature data for a single class. This is the equivalent of "training" the naive bayes model.

**Note 1:** Each feature gets its own mean and standard deviation!

**Note 2:** The way you structure the functions (what is returned) shapes the remainder of the assignment.

In [29]:
def calculate_feature_statistics(data, total):
    data = [[float(value) for value in row] for row in data]

    means = [sum(feature_column) / len(feature_column) for feature_column in zip(*data)]
    
    
    std_devs = [
        (sum((x - mean) ** 2 for x in feature_column) / (len(feature_column) - 1)) ** 0.5
        for feature_column, mean in zip(zip(*data), means)
    ]
    
    prior = len(data) / len(total)
    
    return means, std_devs, prior


To make sure the function works, we should test it before proceding.

In [30]:
# TODO: Make sure to use our previous class splitting function.
# print(type(iris_setosa.train.features), iris_setosa.train.features)

# TODO: Test the function here for one of the dataset classes. 
iris_setosa.train.means, iris_setosa.train.std_devs, iris_setosa.train.prior = calculate_feature_statistics(iris_setosa.train.features, data)
iris_versicolor.train.means, iris_versicolor.train.std_devs, iris_versicolor.train.prior = calculate_feature_statistics(iris_versicolor.train.features, data)
iris_virginica.train.means, iris_virginica.train.std_devs, iris_virginica.train.prior = calculate_feature_statistics(iris_virginica.train.features, data)

# TODO: Print the output from the feature statistic function.
print("feature", iris_setosa.train.features)
print("means", iris_setosa.train.means)
print("std_devs", iris_setosa.train.std_devs) 
print("prior", iris_setosa.train.prior) 

feature [[4.6 3.4 1.4 0.3]
 [4.9 2.4 3.3 1. ]
 [5.8 2.6 4.  1.2]
 [7.2 3.2 6.  1.8]
 [5.5 2.6 4.4 1.2]
 [6.  2.2 5.  1.5]
 [5.7 2.8 4.1 1.3]
 [5.7 2.5 5.  2. ]
 [5.6 3.  4.1 1.3]
 [6.7 3.1 4.4 1.4]
 [5.1 3.5 1.4 0.3]
 [7.9 3.8 6.4 2. ]
 [6.3 3.3 4.7 1.6]
 [5.5 2.4 3.8 1.1]
 [7.7 3.  6.1 2.3]
 [5.4 3.  4.5 1.5]
 [7.3 2.9 6.3 1.8]
 [5.4 3.4 1.7 0.2]
 [5.5 3.5 1.3 0.2]
 [5.1 3.7 1.5 0.4]
 [7.7 3.8 6.7 2.2]
 [5.7 2.8 4.5 1.3]
 [5.9 3.  4.2 1.5]
 [6.1 2.6 5.6 1.4]
 [6.9 3.1 5.4 2.1]
 [6.8 2.8 4.8 1.4]
 [6.5 3.  5.8 2.2]
 [5.6 3.  4.5 1.5]
 [4.8 3.4 1.9 0.2]
 [4.7 3.2 1.6 0.2]
 [6.4 2.9 4.3 1.3]
 [5.2 3.5 1.5 0.2]
 [5.7 2.6 3.5 1. ]
 [6.3 3.3 6.  2.5]
 [5.9 3.2 4.8 1.8]
 [6.  3.  4.8 1.8]
 [5.4 3.9 1.7 0.4]
 [5.5 4.2 1.4 0.2]
 [6.4 3.2 5.3 2.3]
 [7.4 2.8 6.1 1.9]]
means [5.995000000000001, 3.09, 4.095000000000001, 1.295]
std_devs [0.855734801957973, 0.4447990038727086, 1.6906909526687297, 0.7059999273625299]
prior 0.26666666666666666


### 4) Gaussian probability density function (Gaussian PDF)

Now we need to implement the gaussian probability density function to use for a single datapoint.

**Note:** Look at the imports in the first cell at the top, it has some math numbers for easy use here. 

In [31]:
def gaussian_probability_density_function(x, mean, stdev):
    exponent = math.exp(-((x - mean) ** 2) / (2 * (stdev ** 2)))
    coefficient = 1 / (math.sqrt(2 * math.pi * (stdev ** 2)))
    return coefficient * exponent

### 5) Testing Gaussian PDF

We should test it to make sure it works. Train it, using the "calculate_feature_statistics" function, on one of the dataset classes. Then, take one datapoint from the same class and use naive bayes gaussian to make a prediction.

In [11]:
# TODO: Implement the code below to test the "gaussian_probability_density_function" function for one of the classes.



# TODO: Test with one datapoint from the learned class.

iris_setosa.train.gaussian_probability_density = gaussian_probability_density_function()
iris_versicolor.train.gaussian_probability_density = gaussian_probability_density_function()
iris_virginica.train.gaussian_probability_density = gaussian_probability_density_function()

# TODO: Print the probability density
print(...)

SyntaxError: cannot assign to Ellipsis (2628069993.py, line 3)

As a test, take one datapoint from one of the other classes and see if the predicted probability changes.

In [None]:
# TODO: Take one datapoint from one of the other classes.
other_class_datapoint = ...

# TODO: Use naive bayes gaussian on this datapoint with the same feature statistics as the first class.
... = gaussian_probability_density_function(...)

# TODO: Print the probability
print(...)

Think a bit why the probability changes, what could affect the prediction?

## Prepare Naive Bayes for binary classification

### 6) Prepare the data for inference

Before we train and test the naive bayes for multiple classes, we should get our data in order.

Similar to how we did previously, we should now split two classes into a train and test set, you may choose which two classes freely.

In [None]:
# TODO: Split two classes into train and test sets.


# TODO: Sepearte the features and lables for both the train and test set.


### 7) Class A vs Class B for binary classification

**Note:** You might need to go back and forth a bit in the following cells during your implementation of your code.

We have to get the probability from two sets of classes and compare the two probabilities in order to make a propper prediction.

Here we will implement two functions to make this possible. We seperate these functions to make the implementation of the ROC-curve easier later on.

**Function 1: naive_bayes_prediction** 
* A function that returns the probabilities for each class the model for a single datapoint.

**Function 2: probabilities_to_prediction**
* A function that takes in probabilities and returns a prediction.

In [None]:
def naive_bayes_prediction(feature_stats, data_point):

    # TODO: Implement the rest of this function. Make use of previous functions that you have implemented.


    # Note: We need to compare the probabilities at some point between the classes we test on, the predicted class should be the class with the highest probability.
    return prediction_probabilities

In [None]:
def probabilities_to_prediction(probabilities):

    # TODO: Implement prediction of the class with the highest probability

    return class_prediction

To test the function we need the feature metrics from the classes we choose. 

**Note:** Choose the correct train/test set and the correct feature/label split!

In [None]:
# TODO: Get the feature metrics for the classes.

Now we should have implemented all the neccessary parts to train a naive bayes algorithm and do inference on it. Implement a small test workflow for two of your chosen classes.

In [None]:
# TODO: Test the "naive_bayes_prediction" function and implement all neccessary code for it to work.

# TODO: Print the predicted class and the actual class for the test set

## ROC-curve

A ROC curve, or *Receiver Operating Characteristic curve*, is a graphical plot that illustrates the performance of a binary classifier such as our Naive Bayes model.

More info can be found in the course material and here: [https://en.wikipedia.org/wiki/Receiver_operating_characteristic](https://en.wikipedia.org/wiki/Receiver_operating_characteristic)

Another good illustration by Google can be found here: [https://developers.google.com/machine-learning/crash-course/classification/roc-and-auc](https://developers.google.com/machine-learning/crash-course/classification/roc-and-auc)

Now that we have a prediction model, we would want to try it out and test it using a ROC-curve.

### 8) True Positive Rate (TPR) and False Positive Rate (FPR)

From our prediction function we get probabilities, and for prediction purposes we have just predicted the one with the highest probability.

To plot a ROC-curve, we need the TPR and FPR for the binary classification. We will implement this here.

**Note 1:** The threshold is is a value that goes from 0 to 1. 

**Note 2:** One of the two classes will be seen as "the positive class" (prediction over the threshold) and the other as "the negative class" (prediction under the threshold).

**Note 3:** The threshold stepsize will decide the size of the returned TPR/FPR list. A value of 0.1 will give 10 elements (0 to 1 in increments of 0.1)

In [None]:
# Stepsize demonstration
print("Python list:", [x/10 for x in range(0,10,1)])

# Stepsize demonstration with numpy:
print("Numpy linspace:", np.linspace(0,1,11))
print("Numpy linspace (no endpoint):", np.linspace(0,1,10,endpoint=False))

In [None]:
def TPR_and_FPR(prediction_probabilities, test_labels, threshold_stepsize = 0.1):

    # TODO: Implement the rest of this function

    TPR = ...
    FPR = ...

    return TPR, FPR

In [None]:
# TODO: Test the "TPR_and_FPR" function on the model you have created previously.

How does the values change if you change the threshold stepsize? 

How does the values change if you change the classes you compare?

### 9) Plot the TPR and FPR

To better see what is going on, we can plot the TPR and FPR. We can also calculate the Area Under the ROC Curve (AUC or AUROC) at the same time.

In [None]:
def plot_ROC(TPR, FPR):

    # TODO: Calculate the AUC score.

    # TODO: Plot the TPR and FPR using plt (matplotlib)

    # TODO: Add a "middle-line" in the plot. This can be seen as the "better/worse than random" line.

    plt.plot(...)

    plt.show()

    pass # No need to return anything, remove this line once you are finished implementing.

In [None]:
# TODO: Test the plotting function on the TPR and FPR you just calculated.

## Cross-validation

The final task is to take everything you have implemented so far and apply it in a cross-validation loop.

**Note 1:** To better reflect a real scenarios, you should shuffle the data before doing cross-validation.

**Note 2:** When using cross-validation, the interesting thing is the mean performance (mean AUC, mean accuracy, mean ROC-curve).

**Note 3:** This part is a bit more free in terms of implementation, but make sure to use some of the previously implemented functions.

### 10) Cross-validation loop

In [None]:
def cross_validation(data, target, folds = 10, threshold_stepsize = 0.1):

    # TODO: Implement Cross-validation
    
    for i in range(folds):


        ...

    # TODO: Plot the average ROC-curve.
    # NOTE: Take the correct average!

    plot_ROC(...)

### 11) 10-fold Cross-validation on all classes

Test the "cross_validation" function on all the classes against eachother using 10 folds.

* Iris-setosa vs Iris-versicolor
* Iris-setosa vs Iris-virginica
* Iris-versicolor vs Iris-virginica

In [None]:
# TODO: Implement and test cross-validation function on all classes.

In [None]:
# TODO: Implement and test cross-validation function on all classes.

In [None]:
# TODO: Implement and test cross-validation function on all classes.

# Questions for examination:

In addition to completing the assignment with all its tasks, you should also prepare to answer the following questions:

1) Why is it called "naive bayes"?

2) What are some downsides of the naive bayes learning algorithm?

3) When using ROC-curves, what is the theoretical best and worst result you can get?

4) When using ROC-curves, in this assignment for example, is a higher threshold-stepsize always better?  

5) When using cross-validation and ROC-curves, why is it important to take the correct mean values? What could go wrong?


# Finished!

Was part of the setup incorrect? Did you spot any inconsistencies in the assignment? Could something improve?

If so, please write them and send via email and send it to:

* marcus.gullstrand@ju.se

Thank you!

In [None]:
class NaiveBayesClassifier:
    def __init__(self):
        self.class_priors = {}
        self.feature_stats = {}

    def fit(self, X, y):
        """
        Trains the Naive Bayes classifier using the training data.

        Args:
            X (ndarray): Features array.
            y (ndarray): Labels array.
        """
        classes = np.unique(y)
        for cls in classes:
            # Filter data points for the current class
            X_cls = X[y == cls]
            
            # Compute prior probability for the class
            self.class_priors[cls] = len(X_cls) / len(X)
            
            # Compute mean and variance for each feature in the class
            self.feature_stats[cls] = {
                "mean": np.mean(X_cls, axis=0),
                "variance": np.var(X_cls, axis=0) + 1e-6  # Avoid division by zero
            }

    def predict(self, X):
        """
        Predicts class labels for the given input features.

        Args:
            X (ndarray): Features array.

        Returns:
            predictions (ndarray): Predicted class labels.
        """
        predictions = []
        for x in X:
            class_probs = {}
            for cls in self.class_priors:
                # Compute the log-likelihood for each class
                mean = self.feature_stats[cls]["mean"]
                variance = self.feature_stats[cls]["variance"]
                log_likelihood = -0.5 * np.sum(((x - mean) ** 2) / variance + np.log(2 * np.pi * variance))
                class_probs[cls] = np.log(self.class_priors[cls]) + log_likelihood
            
            # Choose the class with the highest probability
            predictions.append(max(class_probs, key=class_probs.get))
        return np.array(predictions)

In [None]:
from sklearn.model_selection import KFold
from sklearn.metrics import roc_curve, auc

def cross_validation_with_roc(classifier, X, y, k=5):
    """
    Performs k-fold cross-validation and calculates ROC curves for each fold.
    
    Args:
        classifier: The Naive Bayes classifier instance.
        X (ndarray): Features array.
        y (ndarray): Labels array.
        k (int): Number of folds for cross-validation.
    
    Returns:
        mean_accuracy (float): Average accuracy across folds.
        roc_data (list): ROC data for each fold (FPR, TPR, AUC).
    """
    kf = KFold(n_splits=k, shuffle=True, random_state=42)
    accuracies = []
    roc_data = []

    for train_idx, test_idx in kf.split(X):
        # Split the data into training and testing for this fold
        X_train, X_test = X[train_idx], X[test_idx]
        y_train, y_test = y[train_idx], y[test_idx]

        # Train the classifier
        classifier.fit(X_train, y_train)
        y_scores = classifier.predict(X_test)

        # Calculate accuracy for this fold
        accuracies.append(np.mean(y_scores == y_test))

        # Calculate ROC curve and AUC for this fold
        fpr, tpr, _ = roc_curve(y_test, y_scores)
        roc_auc = auc(fpr, tpr)
        roc_data.append((fpr, tpr, roc_auc))

    mean_accuracy = np.mean(accuracies)
    return mean_accuracy, roc_data

def plot_mean_roc_curve(roc_data):
    """
    Plots the mean ROC curve from cross-validation results.
    
    Args:
        roc_data (list): ROC data for each fold (FPR, TPR, AUC).
    """
    mean_fpr = np.linspace(0, 1, 100)
    tprs = []
    aucs = []

    for fpr, tpr, roc_auc in roc_data:
        interp_tpr = np.interp(mean_fpr, fpr, tpr)
        interp_tpr[0] = 0.0
        tprs.append(interp_tpr)
        aucs.append(roc_auc)

    mean_tpr = np.mean(tprs, axis=0)
    mean_tpr[-1] = 1.0
    mean_auc = np.mean(aucs)
    std_auc = np.std(aucs)

    plt.figure()
    plt.plot(mean_fpr, mean_tpr, label=f'Mean ROC (AUC = {mean_auc:.2f} ± {std_auc:.2f})', lw=2)
    plt.fill_between(mean_fpr, 
                     np.maximum(mean_tpr - np.std(tprs, axis=0), 0), 
                     np.minimum(mean_tpr + np.std(tprs, axis=0), 1), 
                     color='grey', alpha=0.2, label='± 1 Std. Dev.')
    plt.plot([0, 1], [0, 1], linestyle='--', color='gray', label='Random guess')
    plt.xlabel('False Positive Rate')
    plt.ylabel('True Positive Rate')
    plt.title('Mean ROC Curve')
    plt.legend(loc='lower right')
    plt.grid()
    plt.show()

# Preparing the data for cross-validation
X_full = flower.data[:, :-1].astype(float)
y_full = np.array([1 if label == "Iris-setosa" else 0 for label in flower.data[:, -1]])

# Performing cross-validation and calculating ROC curves
mean_accuracy, roc_data = cross_validation_with_roc(nb_classifier, X_full, y_full, k=5)

# Displaying the mean accuracy
print(f"Mean Accuracy: {mean_accuracy:.2f}")

# Plotting the mean ROC curve
plot_mean_roc_curve(roc_data)

In [None]:
import math

def gaussian_pdf(x, mean, variance):
    """
    Calculates the Gaussian probability density function for a given x.
    
    Args:
        x (float or ndarray): Input value(s).
        mean (float or ndarray): Mean of the Gaussian distribution.
        variance (float or ndarray): Variance of the Gaussian distribution.
        
    Returns:
        float or ndarray: Computed probability density.
    """
    coefficient = 1.0 / math.sqrt(2 * math.pi * variance)
    exponent = math.exp(-((x - mean) ** 2) / (2 * variance))
    return coefficient * exponent

# Testing the Gaussian PDF function
sample_x = 5.0
sample_mean = 4.98918919
sample_variance = 0.1182625

# Calculating the probability density for the sample
pdf_result = gaussian_pdf(sample_x, sample_mean, sample_variance)
pdf_result

In [None]:
# Step 7: Class A vs Class B for binary classification
def filter_binary_classes(features, labels, class_a, class_b):
    """
    Filters the dataset to only include two classes (binary classification).
    
    Args:
        features (ndarray): Feature array.
        labels (ndarray): Labels array.
        class_a (str): The first class.
        class_b (str): The second class.
    
    Returns:
        filtered_features (ndarray): Features of the two classes.
        filtered_labels (ndarray): Labels of the two classes (0 and 1).
    """
    binary_indices = (labels == class_a) | (labels == class_b)
    filtered_features = features[binary_indices]
    filtered_labels = labels[binary_indices]
    binary_labels = np.where(filtered_labels == class_a, 0, 1)  # Map class_a to 0, class_b to 1
    return filtered_features, binary_labels

# Filter training and testing data for binary classification (e.g., Iris-setosa vs Iris-versicolor)
binary_train_features, binary_train_labels = filter_binary_classes(
    train_features, train_labels, "Iris-setosa", "Iris-versicolor"
)
binary_test_features, binary_test_labels = filter_binary_classes(
    test_features, test_labels, "Iris-setosa", "Iris-versicolor"
)

# Verifying the filtered data
binary_train_features[:3], binary_train_labels[:3]

In [None]:
from sklearn.metrics import confusion_matrix

def calculate_tpr_fpr(y_true, y_pred):
    """
    Calculates the True Positive Rate (TPR) and False Positive Rate (FPR).
    
    Args:
        y_true (ndarray): True labels.
        y_pred (ndarray): Predicted labels.
    
    Returns:
        tpr (float): True Positive Rate.
        fpr (float): False Positive Rate.
    """
    tn, fp, fn, tp = confusion_matrix(y_true, y_pred).ravel()
    tpr = tp / (tp + fn)  # True Positive Rate
    fpr = fp / (fp + tn)  # False Positive Rate
    return tpr, fpr

# Train Naive Bayes on the binary dataset
nb_classifier = NaiveBayesClassifier()
nb_classifier.fit(binary_train_features, binary_train_labels)

# Predict on the binary test set
binary_predictions = nb_classifier.predict(binary_test_features)

# Calculate TPR and FPR
tpr, fpr = calculate_tpr_fpr(binary_test_labels, binary_predictions)
tpr, fpr

In [None]:
def plot_tpr_fpr(y_true, y_scores):
    """
    Plots the True Positive Rate (TPR) and False Positive Rate (FPR) as an ROC curve.
    
    Args:
        y_true (ndarray): True binary labels.
        y_scores (ndarray): Predicted probabilities for the positive class.
    """
    fpr, tpr, _ = roc_curve(y_true, y_scores)
    roc_auc = auc(fpr, tpr)
    
    plt.figure()
    plt.plot(fpr, tpr, label=f'ROC curve (AUC = {roc_auc:.2f})')
    plt.plot([0, 1], [0, 1], linestyle='--', color='gray', label='Random guess')
    plt.xlabel('False Positive Rate')
    plt.ylabel('True Positive Rate')
    plt.title('ROC Curve')
    plt.legend(loc='lower right')
    plt.grid()
    plt.show()

# Calculate predicted probabilities for the positive class
binary_test_scores = nb_classifier.predict(binary_test_features)

# Plot the ROC curve
plot_tpr_fpr(binary_test_labels, binary_test_scores)

In [None]:
def cross_validation_binary(classifier, X, y, class_a, class_b, k=5):
    """
    Performs k-fold cross-validation for binary classification and calculates TPR, FPR, and AUC for each fold.
    
    Args:
        classifier: The Naive Bayes classifier instance.
        X (ndarray): Features array.
        y (ndarray): Labels array.
        class_a (str): The first class.
        class_b (str): The second class.
        k (int): Number of folds for cross-validation.
    
    Returns:
        mean_tpr (float): Mean True Positive Rate across folds.
        mean_fpr (float): Mean False Positive Rate across folds.
        mean_auc (float): Mean AUC across folds.
    """
    kf = KFold(n_splits=k, shuffle=True, random_state=42)
    tprs = []
    fprs = []
    aucs = []

    for train_idx, test_idx in kf.split(X):
        # Filter data for binary classification
        X_train, y_train = filter_binary_classes(X[train_idx], y[train_idx], class_a, class_b)
        X_test, y_test = filter_binary_classes(X[test_idx], y[test_idx], class_a, class_b)
        
        # Train the classifier
        classifier.fit(X_train, y_train)
        y_scores = classifier.predict(X_test)

        # Calculate TPR and FPR
        fpr, tpr, _ = roc_curve(y_test, y_scores)
        tprs.append(tpr)
        fprs.append(fpr)
        aucs.append(auc(fpr, tpr))

    mean_tpr = np.mean([np.mean(tpr) for tpr in tprs])
    mean_fpr = np.mean([np.mean(fpr) for fpr in fprs])
    mean_auc = np.mean(aucs)
    return mean_tpr, mean_fpr, mean_auc

# Perform cross-validation for Iris-setosa vs Iris-versicolor
mean_tpr, mean_fpr, mean_auc = cross_validation_binary(
    nb_classifier, train_features, train_labels, "Iris-setosa", "Iris-versicolor", k=5
)

mean_tpr, mean_fpr, mean_auc