# Naive Bayes

Breast Cancer Data Set: https://archive.ics.uci.edu/ml/datasets/Breast+Cancer  
Data set includes 201 instances of one class (No-recurrence) and 85 instances of another class (Recurrence). 20 instances have been reserved as the testing set. The task is to classify each instance using Naive Bayes.  

In [1]:
import pandas as pd
from tabulate import tabulate
from neoBayesian.models.naive import pyNaiveBayes

In [2]:
df = pd.read_csv('sampleFiles/cancer.csv')
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 278 entries, 0 to 277
Data columns (total 9 columns):
age           278 non-null object
menopause     278 non-null object
Tumor-size    278 non-null object
inv-nodes     278 non-null object
Node-caps     278 non-null object
brest         278 non-null object
irradiat      278 non-null object
class         258 non-null object
n             258 non-null float64
dtypes: float64(1), object(8)
memory usage: 19.6+ KB


CSV FORMAT: last 2 columns must be the target and number of observations 'n'. Naive Bayes function recognizes empty cells under the target column ('class' in this example) as the testing set. It is also assumed that n = 1 for each observation in the testing set (n can be > 1 in the training set).

In [3]:
df.head()

Unnamed: 0,age,menopause,Tumor-size,inv-nodes,Node-caps,brest,irradiat,class,n
0,30-39,premeno,30-34,0-2,no,left,no,No-recurrence,1.0
1,50-59,ge40,0-4,0-2,no,left,no,No-recurrence,1.0
2,50-59,ge40,0-4,0-2,no,left,no,No-recurrence,1.0
3,70-79,ge40,0-4,0-2,no,left,no,No-recurrence,1.0
4,40-49,premeno,0-4,0-2,no,left,no,No-recurrence,1.0


In [4]:
df.tail()

Unnamed: 0,age,menopause,Tumor-size,inv-nodes,Node-caps,brest,irradiat,class,n
273,40-49,premeno,25-29,0-2,no,left,no,,
274,40-49,premeno,25-29,0-2,no,left,no,,
275,50-59,premeno,25-29,0-2,no,left,no,,
276,50-59,premeno,25-29,0-2,no,left,no,,
277,50-59,ge40,30-34,0-2,no,left,no,,


In [5]:
# running the function requires only the csv path and the target column name
results = pyNaiveBayes('sampleFiles/cancer.csv', 'class')
# results include tuples with format: (target, probability) for each instance
# in the testing set
print(tabulate(results, showindex=True))

--  -------------------------  ----------------------
 0  ('No-recurrence', 0.0325)  ('Recurrence', 0.9675)
 1  ('No-recurrence', 0.0105)  ('Recurrence', 0.9895)
 2  ('No-recurrence', 0.0)     ('Recurrence', 1.0)
 3  ('No-recurrence', 0.0)     ('Recurrence', 1.0)
 4  ('No-recurrence', 0.9572)  ('Recurrence', 0.0428)
 5  ('No-recurrence', 0.9725)  ('Recurrence', 0.0275)
 6  ('No-recurrence', 0.9545)  ('Recurrence', 0.0455)
 7  ('No-recurrence', 0.9121)  ('Recurrence', 0.0879)
 8  ('No-recurrence', 0.1488)  ('Recurrence', 0.8512)
 9  ('No-recurrence', 0.0821)  ('Recurrence', 0.9179)
10  ('No-recurrence', 0.0821)  ('Recurrence', 0.9179)
11  ('No-recurrence', 0.1415)  ('Recurrence', 0.8585)
12  ('No-recurrence', 0.0315)  ('Recurrence', 0.9685)
13  ('No-recurrence', 0.0199)  ('Recurrence', 0.9801)
14  ('No-recurrence', 0.9241)  ('Recurrence', 0.0759)
15  ('No-recurrence', 0.9055)  ('Recurrence', 0.0945)
16  ('No-recurrence', 0.9055)  ('Recurrence', 0.0945)
17  ('No-recurrence', 0.938)   ('R

Function also has a 'verbose' mode where you get all conditional probabilities leading to the final results for each instance. You need to set 'verbose=True' and specify the number of instances to display with 'display_n' (5 by default).

In [8]:
results = pyNaiveBayes('sampleFiles/cancer.csv', 'class', verbose=True, display_n=2)


-----> OBSERVATION: 0

Pr(Target => class:No-recurrence): 0.72093023
Support: 186

Pr(Observations | No-recurrence):
Var         Category      Probability
----------  ----------  -------------
age         40-49           0.295699
menopause   premeno         0.494624
Tumor-size  40-44           0.0752688
inv-nodes   3-5             0.0806452
Node-caps   yes             0.0860215
brest       right           0.462366

Intersection = 3.01e-06

Pr(Target => class:Recurrence): 0.27906977
Support: 72

Pr(Observations | Recurrence):
Var         Category      Probability
----------  ----------  -------------
age         40-49           0.319444
menopause   premeno         0.569444
Tumor-size  40-44           0.0833333
inv-nodes   3-5             0.236111
Node-caps   yes             0.430556
brest       right           0.5

Intersection = 8.959e-05

 RESULTS:
╒═══════════════╤═════════╕
│ Target        │   Proba │
╞═══════════════╪═════════╡
│ No-recurrence │  0.0325 │
├───────────────┼────────