# Categorical Naive Bayes

In [1]:
# Load Python libraries
import pandas as pd
import numpy as np

In [2]:
# Load dataset and display the first several data samples.
df = pd.read_csv("covid-dataset.csv")
df.head()

Unnamed: 0,Breathing Problem,Fever,Dry Cough,Sore throat,Running Nose,Asthma,Chronic Lung Disease,Headache,Heart Disease,Diabetes,...,Fatigue,Gastrointestinal,Abroad travel,Contact with COVID Patient,Attended Large Gathering,Visited Public Exposed Places,Family working in Public Exposed Places,Wearing Masks,Sanitization from Market,COVID-19
0,Yes,Yes,Yes,Yes,Yes,No,No,No,No,Yes,...,Yes,Yes,No,Yes,No,Yes,Yes,No,No,Yes
1,Yes,Yes,Yes,Yes,No,Yes,Yes,Yes,No,No,...,Yes,No,No,No,Yes,Yes,No,No,No,Yes
2,Yes,Yes,Yes,Yes,Yes,Yes,Yes,Yes,No,Yes,...,Yes,Yes,Yes,No,No,No,No,No,No,Yes
3,Yes,Yes,Yes,No,No,Yes,No,No,Yes,Yes,...,No,No,Yes,No,Yes,Yes,No,No,No,Yes
4,Yes,Yes,Yes,Yes,Yes,No,Yes,Yes,Yes,Yes,...,No,Yes,No,Yes,No,Yes,No,No,No,Yes


In [3]:
# List of columns in the data
df.columns

Index(['Breathing Problem', 'Fever', 'Dry Cough', 'Sore throat',
       'Running Nose', 'Asthma', 'Chronic Lung Disease', 'Headache',
       'Heart Disease', 'Diabetes', 'Hyper Tension', 'Fatigue ',
       'Gastrointestinal ', 'Abroad travel', 'Contact with COVID Patient',
       'Attended Large Gathering', 'Visited Public Exposed Places',
       'Family working in Public Exposed Places', 'Wearing Masks',
       'Sanitization from Market', 'COVID-19'],
      dtype='object')

This dataset is to differentiate Covid-19 from flu by observing patient features. All of the features are categorical.
- 'Breathing Problem': yes or no
- 'Fever': yes or no
- 'Dry Cough': yes or no
- 'Sore throat': yes or no
- 'Running Nose': yes or no
- 'Asthma': yes or no 
- 'Chronic Lung Disease': yes or no
- 'Headache': yes or no
- 'Heart Disease': yes or no 
- 'Diabetes': yes or no
- 'Hyper Tension': yes or no
- 'Fatigue ': yes or no
- 'Gastrointestinal ': yes or no
- 'Abroad travel': yes or no
- 'Contact with COVID Patient': yes or no
- 'Attended Large Gathering': yes or no
- 'Visited Public Exposed Places': yes or no
- 'Family working in Public Exposed Places': yes or no
- 'Wearing Masks': yes or no
- 'Sanitization from Market': yes or no
- 'COVID-19': yes or no (Label)




In [4]:
# Make a copy of the data
data = df.copy()

In [5]:
# Get the list of columns (features) of the categorical (object, string) type.
# List of pandas data types.
# 'b'       boolean
# 'i'       (signed) integer
# 'u'       unsigned integer
# 'f'       floating-point
# 'c'       complex-floating point
# 'O'       (Python) objects
# 'S', 'a'  (byte-)string
# 'U'       Unicode
# 'V'       raw data (void)
cat_cols = [col for col in data.columns if data[col].dtypes == "O"]

# Remove label from the list
cat_cols.remove("COVID-19")

# Convert data features to dummy varables, i.e., one-hot encoding
feature_data = pd.get_dummies(data, columns=cat_cols)

# Drop 'COVID-19' column from the data frame as it is the data label.
# Parameter axis: {0 or ‘index’, 1 or ‘columns’}, default 0: Whether to drop labels from the index (0 or ‘index’) or columns (1 or ‘columns’).
feature_data = feature_data.drop(["COVID-19"], axis=1)

# Get data label from "COVID-19" column
y = data["COVID-19"].values

# Convert data label to numerical values: 0 (Yes) or 1 (No)
y = [1 if i=="Yes" else 0 for i in y]
# np.sum(y)

In [6]:
# Split the dataset to train/test set
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(feature_data, y, train_size=0.7, random_state=0)

In [7]:
# Initialize categorical Naive Bayes model
from sklearn.naive_bayes import CategoricalNB
model = CategoricalNB()

# Train the model using X_train and y_train
model.fit(X_train, y_train)

In [8]:
# Impport libraries to calculate evaluation metrics: precision, recall, f1 score.
from sklearn.metrics import precision_score, recall_score, f1_score, classification_report

# Make prediction on the test data
predicted_label = model.predict(X_test)

# Calculate evaluation metrics by comparing the prediction with the data label y_test
print(precision_score(predicted_label, y_test))
print(recall_score(predicted_label, y_test))
print(f1_score(predicted_label, y_test))
print(classification_report(predicted_label, y_test))

0.9664889565879665
0.976905311778291
0.9716692189892803
              precision    recall  f1-score   support

           0       0.91      0.87      0.89       332
           1       0.97      0.98      0.97      1299

    accuracy                           0.95      1631
   macro avg       0.94      0.92      0.93      1631
weighted avg       0.95      0.95      0.95      1631



# Mixed Naive Bayes

#### Reference
https://pypi.org/project/mixed-naive-bayes/

This module implements Categorical (Multinoulli) and Gaussian naive Bayes algorithms (hence mixed naive Bayes). This means that we are not confined to the assumption that features (given their respective y's) follow the Gaussian distribution, but also the categorical distribution. Hence it is natural that the continuous data be attributed to the Gaussian and the categorical data (nominal or ordinal) be attributed the the categorical distribution.

The motivation for writing this library is that scikit-learn does not have an implementation for mixed type of Naive Bayes

In [9]:
# Install the library
# !pip install git+https://github.com/remykarem/mixed-naive-bayes#egg=mixed_naive_bayes

In [1]:
# Import mixed Naive Bayes library
from mixed_naive_bayes import MixedNB

# Below is an example of a dataset with discrete (first 2 columns) and continuous data (last 2).
X = [[0, 0, 180, 75],
     [1, 1, 165, 61],
     [2, 1, 166, 60],
     [1, 1, 173, 68],
     [0, 2, 178, 71]]
y = [0, 0, 1, 1, 0]

# Specify the indices of the features which are to follow the categorical distribution (columns 0 and 1).
clf = MixedNB(categorical_features=[0,1])

# Train the model using data features X and label y
clf.fit(X,y)

# Make sample predictions on some data
clf.predict(X)

array([0, 1, 1, 1, 0])