In [2]:
import pandas as pd

Custom NBC algorithm implementation using `car.csv` dataset from [Kaggle](https://www.kaggle.com/elikplim/car-evaluation-data-set).

In [3]:
data = pd.read_csv("data/raw/car.csv", dtype = "category", header = None)
data.columns = ["buying", "maint", "doors", "persons", "lug-boot", "safety", "accept"]

Use the `train_test_split` method (use the documentation) to divide the data into 75% training and 25% testing data. Use parameter random_state = 0 to set the same random seed and ensure that we all get the same results.

In [4]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(data.iloc[:,:-1], data['accept'], test_size = 0.25, random_state = 0)
X_train

Unnamed: 0,buying,maint,doors,persons,lug-boot,safety
520,high,vhigh,5more,2,big,med
621,high,high,5more,2,small,low
1017,med,high,3,more,small,low
1273,med,low,5more,2,med,med
924,med,vhigh,4,2,big,low
...,...,...,...,...,...,...
835,high,low,4,more,big,med
1216,med,low,3,2,small,med
1653,low,low,3,2,big,low
559,high,high,2,more,small,med


In [5]:
def accuracy(actual, predicted):
  return sum(actual == predicted) / len(predicted)

If everything is implemented correctly, then accuracy on the test set:
1. with Laplace smoothing will be about $81.25\%$,
2. without smoothing will be about $82.17\%$.

In [6]:
from src.naive_bayes import NaiveBayes

# With smoothing
# model = NaiveBayes(smoothing=True)

# without smoothing
model = NaiveBayes(smoothing=False)

model.fit(X_train, y_train)
predictions = model.predict(X_test)
accuracy(predictions, y_test)

0.8217592592592593

## Comparing with `scikit-learn`'s implementation

For `sklearn`'s Naive Bayes implementation the data should be converted to numeric as in the case of KNN. Sklearn has different types of Naive Bayes classifiers such as `GaussianNB`, `MultinomialNB`, `BernouliNB` etc., for different types of data. To learn more about these different types visit [here](https://scikit-learn.org/stable/modules/naive_bayes.html). 

In our case we have categorical data and our assumption was that each feature has `categorical distribution` (generalization of Bernoulli distribution for more than 2 possible outcomes).

In [7]:
from sklearn.preprocessing import OrdinalEncoder
enc = OrdinalEncoder()
data1 = enc.fit_transform(data)
X_train1, X_test1, y_train1, y_test1 = train_test_split(data1[:, :-1], data1[:, -1], test_size=0.25, random_state=0)

In [8]:
from sklearn.naive_bayes import CategoricalNB
model1 = CategoricalNB(alpha=1)
model1.fit(X_train1, y_train1)

In [9]:
predictions1 = model1.predict(X_test1)
accuracy(predictions1, y_test1)

0.8125

In [11]:
model1 = CategoricalNB(alpha=1e-10)
model1.fit(X_train1, y_train1)
predictions1 = model1.predict(X_test1)
accuracy(predictions1, y_test1)

0.8217592592592593