# Naive Bayes

#### Ruixuan Dong

### Table of Contents

 - [The Naive Bayes classification algorithm](#1)
    - [Algorithm Explanation](#11)
    - [An Intuitive Explanation](#12)
 - [Research Problem -- Using synthetic data to classify different systolic levels](#2)
    - [Overview of the Problem set](#21)
    - [Implement Naive Bayes on Synthetic data](#22)


<a name='11'></a>
### 1- The Naive Bayes classification algorithm

In this section, we'll explain the Naive Bayes classification algorithm step by step.



<a name='11'></a>
#### 1-1 Algorithm Explanation

The code provided defines a simple Naive Bayes classifier. Here's an overview of the key functions and their purpose:

 - `loadDataSet()`: This function defines a simple dataset of text messages and assigns binary labels (0 or 1) to each message. It returns two lists: `postingList` containing the text messages and `classVec` containing the corresponding labels.
 - `createVocabList(dataSet)`: This function takes the `postingList` and creates a vocabulary set by extracting unique words from all the text messages.
 - `setOfWords2Vec(vocabList, inputSet)`: This function takes a vocabulary list and a text message as input and converts the message into a binary vector. Each element of the vector corresponds to a word in the vocabulary, and a 1 indicates the presence of that word in the message.
 - `trainNB0(trainMatrix, trainCategory)`: This function performs the training of the Naive Bayes classifier. It calculates the probabilities of words given the class (either 0 or 1) and the probability of class 1 (`pAbusive`).
 - `classifyNB(vec2Classify, p0Vec, p1Vec, pClass1)`: This function classifies a new input vector (`vec2Classify`) based on the probabilities calculated during training. It computes the probability of the input vector belonging to class 1 and class 0 and returns the class with the higher probability.
 - `testingNB()`: This function demonstrates the classification by testing two example messages.

In [4]:
def loadDataSet():
    postingList=[['my', 'dog', 'has', 'flea', 'problems', 'help', 'please'],
                 ['maybe', 'not', 'take', 'him', 'to', 'dog', 'park', 'stupid'],
                 ['my', 'dalmation', 'is', 'so', 'cute', 'I', 'love', 'him'],
                 ['stop', 'posting', 'stupid', 'worthless', 'garbage'],
                 ['mr', 'licks', 'ate', 'my', 'steak', 'how', 'to', 'stop', 'him'],
                 ['quit', 'buying', 'worthless', 'dog', 'food', 'stupid']]
    classVec = [0,1,0,1,0,1] 
    return postingList,classVec

In [5]:
def createVocabList(dataSet):
    vocabSet = set([])
    for document in dataSet:
        vocabSet = vocabSet | set(document) 
    return list(vocabSet)

In [6]:
def setOfWords2Vec(vocabList, inputSet):
    returnVec = [0] * len(vocabList)
    for word in inputSet:
        if word in vocabList:
            returnVec[vocabList.index(word)] = 1
        else: print("the word: %s is not in my Vocabulary!" % word)
    return returnVec

In [None]:
listOPosts,listClasses = loadDataSet()
myVocabList = createVocabList(listOPosts)
myVocabList

In [20]:
def trainNB0(trainMatrix,trainCategory):
    numTrainDocs = len(trainMatrix)
    numWords = len(trainMatrix[0])
    pAbusive = sum(trainCategory)/float(numTrainDocs)
    p0Num = np.zeros(numWords); p1Num = np.zeros(numWords)
    p0Denom = 0.0; p1Denom = 0.0
    for i in range(numTrainDocs):
        if trainCategory[i] == 1:
            p1Num += trainMatrix[i]
            p1Denom += sum(trainMatrix[i])
        else:
            p0Num += trainMatrix[i]
            p0Denom += sum(trainMatrix[i])
    p1Vect = p1Num/p1Denom          #change to log()
    p0Vect = p0Num/p0Denom          #change to log()
    return p0Vect,p1Vect,pAbusive

In [39]:
from functools import reduce  
def classifyNB(vec2Classify, p0Vec, p1Vec, pClass1):
    p1 = reduce(lambda x,y:x*y, vec2Classify * p1Vec) * pClass1 
    p0 = reduce(lambda x,y:x*y, vec2Classify * p0Vec) * (1.0 - pClass1)
    print('p0:',p0)
    print('p1:',p1)
    if p1 > p0:
        return 1
    else: 
        return 0

In [22]:
def testingNB():
    listOPosts,listClasses = loadDataSet()
    myVocabList = createVocabList(listOPosts)
    trainMat=[]
    for postinDoc in listOPosts:
        trainMat.append(setOfWords2Vec(myVocabList, postinDoc))
    p0V,p1V,pAb = trainNB0(np.array(trainMat),np.array(listClasses))
    testEntry = ['love', 'my', 'dalmation']
    thisDoc = np.array(setOfWords2Vec(myVocabList, testEntry))
    if classifyNB(thisDoc,p0V,p1V,pAb):
        print(testEntry,'yes')
    else:
        print(testEntry,'no')
    testEntry = ['stupid', 'garbage']
    thisDoc = np.array(setOfWords2Vec(myVocabList, testEntry))
    if classifyNB(thisDoc,p0V,p1V,pAb):
        print(testEntry,'yes')
    else:
        print(testEntry,'no')


In [24]:
import numpy as np

listOPosts,listClasses = loadDataSet()
myVocabList = createVocabList(listOPosts)
myVocabList
trainMat = []

for postinDoc in listOPosts:
    trainMat.append(setOfWords2Vec(myVocabList, postinDoc)) 
    
p0V,p1V,pAb=trainNB0(trainMat,listClasses)

pAb

0.5

In [25]:
p0V

array([0.04166667, 0.04166667, 0.        , 0.04166667, 0.        ,
       0.04166667, 0.04166667, 0.04166667, 0.        , 0.        ,
       0.        , 0.08333333, 0.04166667, 0.04166667, 0.04166667,
       0.        , 0.04166667, 0.04166667, 0.        , 0.        ,
       0.04166667, 0.04166667, 0.04166667, 0.04166667, 0.04166667,
       0.04166667, 0.04166667, 0.        , 0.        , 0.125     ,
       0.04166667, 0.        ])

The code defines and trains a Naive Bayes classifier on the provided dataset. It then classifies two test messages and prints the results.

<a name='12'></a>
#### 1-2 An Intuitive Explanation


Naive Bayes is a probabilistic algorithm used for classification. It works by calculating the probability that a given input belongs to each class and then choosing the class with the highest probability.

In this specific implementation, we use text data, and the algorithm assumes that the words in the text are conditionally independent given the class label (hence "naive"). It calculates the probability of each word occurring in messages belonging to class 0 and class 1 during training.

For example, if you have a message "love my dalmatian," the algorithm calculates the probability of each word ("love," "my," "dalmatian") appearing in messages labeled 0 and 1. It uses these probabilities to classify new messages.

<a name="2"></a>
### 2 - Research Problem - Using Synthetic Data to Classify Different Systolic Levels

<a name="21"></a>
#### 2-1 Overview of the Problem Set

In this section, you can introduce the problem you're trying to solve with Naive Bayes classification. You might want to replace "systolic levels" with the actual problem you're addressing. Describe the dataset and the problem statement.

**Problem Statement:** The generated dataset containing: 
- a dataset set ("total_large.csv") of 6,000 samples labeled as lower (100<=systolic blood pressure<140) or higher (140<=systolic blood pressure<=160) 
- each sample is of shape (1, 1003) where 1003 is for the 1000-d signal and heart rate, respiratory rate and diastolic blood pressure

In this part, we will build a simple Naive Bayes classifier that can correctly classify samples as lower or higher (SBP).

Let's get more familiar with the dataset. Load the data by running the following code.

In [29]:
import pandas as pd
column_names = [str(i) for i in range(1, 1001)] + ['heart_rate', 'respiratory_rate', 'systolic', 'diastolic']
total = pd.read_csv('total_large.csv', 
                     header=None, 
                     names=column_names)
total.head(3)

Unnamed: 0,1,2,3,4,5,6,7,8,9,10,...,995,996,997,998,999,1000,heart_rate,respiratory_rate,systolic,diastolic
0,5.439234e-08,2.583753e-07,1e-06,4e-06,8e-06,6e-06,-4.897503e-07,-4e-06,-2.029141e-07,3.029687e-06,...,-3.750468e-08,-3.504179e-08,-3.266654e-08,-2.969555e-08,-2.688206e-08,-2.599564e-08,109.0,19.0,160.0,66.0
1,5.781177e-08,3.850786e-07,2e-06,7e-06,7e-06,-1e-06,-3.642447e-06,2e-06,8.308896e-07,-1.850758e-06,...,-3.937486e-08,-3.615418e-08,-3.250324e-08,-2.930146e-08,-2.813366e-08,-2.915194e-08,131.0,15.0,153.0,64.0
2,3.434446e-08,2.098668e-07,3e-06,6e-06,-3e-06,2e-06,-1.939304e-06,1e-06,-9.990558e-07,3.452373e-07,...,-3.199401e-08,-2.472291e-08,-1.890941e-08,-1.882332e-08,-2.18826e-08,-2.335538e-08,128.0,14.0,120.0,85.0


In [31]:
def signal2matrix(total):
    total = total.values

    numberOfLines = len(total)
    returnMat = np.zeros((numberOfLines, 1003))
    classLabelVector = []
    index = 0

    for line in total:
        returnMat[index, :1002] = line[:1002]
        returnMat[index, 1002] = line[1003]
        if 100 <=line[1002]< 140:
            classLabelVector.append(1)
        elif 140 <=line[1002]<= 160:
            classLabelVector.append(2)
        index += 1
    return returnMat, classLabelVector

In [32]:
signalDataMat,signalLabels = signal2matrix(total)

In [33]:
def autoNorm(dataSet):
    minVals = dataSet.min(0)
    maxVals = dataSet.max(0)
    ranges = maxVals - minVals
    normDataSet = np.zeros(np.shape(dataSet))
    m = dataSet.shape[0]
    normDataSet = dataSet - np.tile(minVals, (m, 1))
    normDataSet = normDataSet / np.tile(ranges, (m, 1))
    return normDataSet, ranges, minVals

In [34]:
normMat, ranges, minVals = autoNorm(signalDataMat)

<a name='22'></a>
#### 2-2Implement Naive Bayes on Synthetic data

In [None]:
def signal2matrix(total):
    total = total.values

    numberOfLines = len(total)
    featureMatrix = np.zeros((numberOfLines, 1002))
    classLabelVector = []

    for index, line in enumerate(total):
        featureMatrix[index, :] = line[:1002]
        classLabel = line[1002]
        if 100 <= classLabel < 140:
            classLabelVector.append(1)
        elif 140 <= classLabel <= 160:
            classLabelVector.append(2)

    return featureMatrix, classLabelVector

from sklearn.model_selection import train_test_split

# Assuming you have loaded your dataset into 'total'
X, y = signal2matrix(total)

# Split the dataset into 80% training and 20% testing
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Assuming you have already defined the Naive Bayes functions

# Train the Naive Bayes classifier on the training data
p0V, p1V, pAb = trainNB0(X_train, y_train)

# Optionally, you can print the class probabilities
print('p0V:', p0V)
print('p1V:', p1V)
print('pAb:', pAb)

from sklearn.metrics import accuracy_score

# Assuming you have already trained the classifier
y_pred = []

for x in X_test:
    thisDoc = np.array(x)
    class_label = classifyNB(thisDoc, p0V, p1V, pAb)
    y_pred.append(class_label)

accuracy = accuracy_score(y_test, y_pred)
print('Accuracy:', accuracy)
