## Supplement 4: Classification

In [2]:
%matplotlib inline
import numpy as np
import pandas as pd
import sklearn


### 4.1 Programming Task: Gaussian Naive-Bayes Classifier
The Iris dataset, containing measurements of the flower parts obtained from 3 different species of the Iris plant, is provided in the file __iris.csv__. The first four columns of the dataset contain the measurement values representing input features for the model and the last column contains class labels of the plant species: Iris-setosa, Iris-versicolor, and Iris-virginica.
The goal of this task is to implement a Gaussian Naive-Bayes classifier for the Iris dataset.

i\. What are the assumptions on the dataset required for the Gaussian Naive-Bayes model?

ii\. Split the dataset into train and test by the 80:20 ratio.


In [13]:
data=pd.read_csv('./iris.csv')

# i. Given the class the features are conditionally independent i.e product(p(x1,x2...xn|y=k)=p(x1|y=k)*....p(xn|y=k)
train_length=int(0.8*len(data))
train_data=data[:train_length]
test_data=data[train_length:]
print(len(data),len(train_data),len(test_data))
def to_numpy(X):
    return X.to_numpy()

x_train,y_train,x_test,y_test=train_data.drop(columns=["Species"]),train_data["Species"],test_data.drop(columns=["Species"]),test_data["Species"]
x_train_np,x_test_np=to_numpy(x_train),to_numpy(x_test)
unique_classes=y_train.unique()
str_to_int={cls:idx+1 for idx,cls in enumerate(unique_classes)}
y_train_int,y_test_int=y_train.map(str_to_int),y_test.map(str_to_int)
y_train_np,y_test_np=to_numpy(y_train_int),to_numpy(y_test_int)
print(np.unique(y_train_np))

150 120 30
[1 2 3]


iii\. Estimate the parameters of the Gaussian Naive-Bayes classifier using the train set.


In [20]:

def mean_var_prior(X,Y):
    classes=np.unique(Y)
    mean,Var,Prior={},{},{}
    for cls in classes:
        X_cls=X[Y==cls]
        mean[cls]=np.mean(X_cls,axis=0)
        Var[cls]=np.cov(X_cls,rowvar=False)
        Prior[cls]=len(X_cls)/len(Y)
    return mean,Var,Prior
    

def multivariate_gaussian(X,mean,cov):
    d=X.shape[-1]
    det=np.linalg.det(cov)
    normalizer=1/(np.power((2*np.pi),float(d/2))*np.power(det,1./2))
    x_mu=X-mean
    inv=np.linalg.inv(cov)
    p_X=normalizer*np.exp(-0.5*(x_mu.T@inv@x_mu))
    return p_X

def predict(X,mean,Var,Prior):
    classes=list(mean.keys())
    predictions=[]
    for x in X:
        class_probs={}
        for cls in classes:
            p_x_given_cls=multivariate_gaussian(x,mean[cls],Var[cls])
            class_probs[cls]=p_x_given_cls*Prior[cls]
        predicted_class=max(class_probs,key=class_probs.get)
        predictions.append(predicted_class)
    return np.array(predictions)


Mean,Var,Prior=mean_var_prior(x_train_np,y_train_np)
predictions=predict(x_train_np,Mean,Var,Prior)
accuracy=np.sum(y_train_np==predictions)/len(y_train_np)
print(accuracy)


0.9916666666666667


iv\. Using the learned parameters, predict the classes for the samples in the test set.


In [21]:

predictions=predict(x_test_np,Mean,Var,Prior)
accuracy=np.sum(y_test_np==predictions)/len(y_test_np)
print(accuracy)


0.9666666666666667


What is the accuracy of the model on the test set?

In [25]:
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import accuracy_score
model=GaussianNB()
model.fit(x_train_np,y_train_np)
pred=model.predict(x_test_np)
accuracy=accuracy_score(y_test_np,pred)
print(accuracy)



0.9333333333333333
