# PA5 - Naive Bayes

## Step 1 - Train Dataset

For this step, we will create a Naive Bayes classifier for the "train" dataset, provided below. 

In [8]:
table = [
    ["weekday", "spring", "none", "none", "on time"],
    ["weekday", "winter", "none", "slight", "on time"],
    ["weekday", "winter", "none", "slight", "on time"],
    ["weekday", "winter", "high", "heavy", "late"], 
    ["saturday", "summer", "normal", "none", "on time"],
    ["weekday", "autumn", "normal", "none", "very late"],
    ["holiday", "summer", "high", "slight", "on time"],
    ["sunday", "summer", "normal", "none", "on time"],
    ["weekday", "winter", "high", "heavy", "very late"],
    ["weekday", "summer", "none", "slight", "on time"],
    ["saturday", "spring", "high", "heavy", "cancelled"],
    ["weekday", "summer", "high", "slight", "on time"],
    ["saturday", "winter", "normal", "none", "late"],
    ["weekday", "summer", "high", "none", "on time"],
    ["weekday", "winter", "normal", "heavy", "very late"],
    ["saturday", "autumn", "high", "slight", "on time"],
    ["weekday", "autumn", "none", "heavy", "on time"],
    ["holiday", "spring", "normal", "slight", "on time"],
    ["weekday", "spring", "normal", "none", "on time"],
    ["weekday", "spring", "normal", "slight", "on time"]
]

To create this classifier, we first must calculate prior probabilities for each class label.
We will do this with the following helper function:

* `calculatePriors()`
    * **Params**:
        * `data` - The dataset to calculate prior probabilities for
        * `index` - The index of the classifier value in the dataset
        * `classes` - An array of the classes present in the dataset
    * **Returns**:
        * An array of prior probability values, ordered according to the `classes` param.

In [24]:
def calculatePriors(data, index, classes):
    counts = {}
    total = 0
    for label in classes:
        counts[label] = 0
    for instance in data:
        counts[instance[index]] += 1
        total += 1
    probabilities = []
    for label in classes:
        probabilities.append(counts[label] / total)
    return probabilities

To check these values, we will refer to **Figure 3.2** from the Bramer textbook, which claims that the Prior Probabilities for "on time", "late" "very late", and "cancelled" should be 0.70, 0.10, 0.15, and 0.05, respectively.

We will run our `calculatePriors()` function on these labels and display the values, which should match the given probabilities.

In [25]:
priors = calculatePriors(table, 4, ["on time", "late", "very late", "cancelled"])
print(priors)

[0.7, 0.1, 0.15, 0.05]


As we can see, the values returned from `calculatePriors()` are the same as given in Bramer. Next, we will calculate the posterior probabilities for a given class label. Again, we will create a helper function `calculatePosteriors()` to find these values:

* `calculatePosteriors()`
    * **Params**:
        * `data` - The dataset to calculate probabilities for
        * `attributeIndex` - The index of the attribute to calculate conditional probability for
        * `attribute` - The value of the index to calculate conditional probability for
        * `classIndex` - The index of the class label
        * `classLabels` - All classifier labels to calculate conditional probabilities for
    * **Returns**:
        * A list of posterior probabilities for the given attribute over each class label, ordered with respect to `classLabels()`

In [28]:
def calculatePosteriors(data, attributeIndex, attribute, classIndex, classLabels):
    conditionalCounts = {}
    counts = {}
    for label in classLabels:
        counts[label] = 0
        conditionalCounts[label] = 0
    for instance in data:
        counts[instance[classIndex]] += 1
        if instance[attributeIndex] == attribute:
            conditionalCounts[instance[classIndex]] += 1
    probabilities = []
    for label in classLabels:
        probabilities.append(conditionalCounts[label] / counts[label])
    return probabilities

Once again, we will check our function output against the values provided by Bramer. For the attribute (day = "weekday"), we expect the posterior probabilities for class = "on time", "late", "very late", and "cancelled" to be 0.64, 0.5, 1, and 0, respectively.

In [29]:
posteriors = calculatePosteriors(table, 0, "weekday", 4, ["on time", "late", "very late", "cancelled"])
print(posteriors)

[0.6428571428571429, 0.5, 1.0, 0.0]


Again, our values match up, although Bramer's values round off to 2 decimal places while ours sometimes have more bits of precision.

Finally, we can use this to create a Naive Bayes classifier function.

* `naiveBayesClassify()`
    * **Params**:
        * `train` - The data to use as training data
        * `classIndex` - The index of the class label
        * `classLabels` - A list of all possible class labels
        * `test` - The unseen data to classify
    * **Returns**:
        * A classification for the `test` instance

In [46]:
def naiveBayesClassify(train, classIndex, classLabels, test):
    classProbabilities = []
    for val in classLabels:
        classProbabilities.append(0)
    priors = calculatePriors(train, classIndex, classLabels)
    for i in range(len(classProbabilities)):
        classProbabilities[i] += priors[i]
    for i in range(len(test)):
        if i == classIndex:
            continue
        else:
            attribute = test[i]
            posteriors = calculatePosteriors(train, i, attribute, classIndex, classLabels)
            for i in range(len(posteriors)):
                classProbabilities[i] *= posteriors[i]
    maxP = 0
    index = 0
    for i in range(len(classProbabilities)):
        if classProbabilities[i] > maxP:
            maxP = classProbabilities[i]
            index = i
    return classLabels[index]
            

To test that our classifier is functioning as intended, we will test it on the trains dataset using the unseen value ("weekday", "winter", "high", "heavy", "???"). Accordign to Bramer, this should be classified as "very late".

In [48]:
label = naiveBayesClassify(table, 4, ["on time", "late", "very late", "cancelled"], ["weekday", "winter", "high", "heavy", "???"])
print(label)

very late


And, as shown, the `naiveBayesClassify()` function does indeed classify the unseen data correctly. Therefore, we now have a functioning method for using Naive Bayes classification accross a given dataset.