# PA5 - Naive Bayes

## Step 1 - Train Dataset

For this step, we will create a Naive Bayes classifier for the "train" dataset, provided below. 

In [79]:
table = [
    ["weekday", "spring", "none", "none", "on time"],
    ["weekday", "winter", "none", "slight", "on time"],
    ["weekday", "winter", "none", "slight", "on time"],
    ["weekday", "winter", "high", "heavy", "late"], 
    ["saturday", "summer", "normal", "none", "on time"],
    ["weekday", "autumn", "normal", "none", "very late"],
    ["holiday", "summer", "high", "slight", "on time"],
    ["sunday", "summer", "normal", "none", "on time"],
    ["weekday", "winter", "high", "heavy", "very late"],
    ["weekday", "summer", "none", "slight", "on time"],
    ["saturday", "spring", "high", "heavy", "cancelled"],
    ["weekday", "summer", "high", "slight", "on time"],
    ["saturday", "winter", "normal", "none", "late"],
    ["weekday", "summer", "high", "none", "on time"],
    ["weekday", "winter", "normal", "heavy", "very late"],
    ["saturday", "autumn", "high", "slight", "on time"],
    ["weekday", "autumn", "none", "heavy", "on time"],
    ["holiday", "spring", "normal", "slight", "on time"],
    ["weekday", "spring", "normal", "none", "on time"],
    ["weekday", "spring", "normal", "slight", "on time"]
]

To create this classifier, we first must calculate prior probabilities for each class label.
We will do this with the following helper function:

* `calculate_priors()`
    * **Params**:
        * `data` - The dataset to calculate prior probabilities for
        * `index` - The index of the classifier value in the dataset
        * `classes` - An array of the classes present in the dataset
    * **Returns**:
        * An array of prior probability values, ordered according to the `classes` param.

In [80]:
def calculate_priors(data, index, classes):
    counts = {}
    total = 0
    for label in classes:
        counts[label] = 0
    for instance in data:
        counts[instance[index]] += 1
        total += 1
    probabilities = []
    for label in classes:
        probabilities.append(counts[label] / total)
    return probabilities

To check these values, we will refer to **Figure 3.2** from the Bramer textbook, which claims that the Prior Probabilities for "on time", "late" "very late", and "cancelled" should be 0.70, 0.10, 0.15, and 0.05, respectively.

We will run our `calculatePriors()` function on these labels and display the values, which should match the given probabilities.

In [81]:
priors = calculate_priors(table, 4, ["on time", "late", "very late", "cancelled"])
print(priors)

[0.7, 0.1, 0.15, 0.05]


As we can see, the values returned from `calculate_priors()` are the same as given in Bramer. Next, we will calculate the posterior probabilities for a given class label. Again, we will create a helper function `calculate_posteriors()` to find these values:

* `calculate_posteriors()`
    * **Params**:
        * `data` - The dataset to calculate probabilities for
        * `attributeIndex` - The index of the attribute to calculate conditional probability for
        * `attribute` - The value of the index to calculate conditional probability for
        * `classIndex` - The index of the class label
        * `classLabels` - All classifier labels to calculate conditional probabilities for
    * **Returns**:
        * A list of posterior probabilities for the given attribute over each class label, ordered with respect to `classLabels()`

In [96]:
def calculate_posteriors(data, attributeIndex, attribute, classIndex, classLabels):
    conditionalCounts = {}
    counts = {}
    for label in classLabels:
        counts[label] = 0
        conditionalCounts[label] = 0
    for instance in data:
        counts[instance[classIndex]] += 1
        if instance[attributeIndex] == attribute:
            conditionalCounts[instance[classIndex]] += 1
    probabilities = []
    for label in classLabels:
        if counts[label] == 0:
            probabilities.append(0)
        else:
            probabilities.append(conditionalCounts[label] / counts[label])
    return probabilities

Once again, we will check our function output against the values provided by Bramer. For the attribute (day = "weekday"), we expect the posterior probabilities for class = "on time", "late", "very late", and "cancelled" to be 0.64, 0.5, 1, and 0, respectively.

In [97]:
posteriors = calculate_posteriors(table, 0, "weekday", 4, ["on time", "late", "very late", "cancelled"])
print(posteriors)

[0.6428571428571429, 0.5, 1.0, 0.0]


Again, our values match up, although Bramer's values round off to 2 decimal places while ours sometimes have more bits of precision.

Finally, we can use this to create a Naive Bayes classifier function.

* `naive_bayes_classify()`
    * **Params**:
        * `train` - The data to use as training data
        * `classIndex` - The index of the class label
        * `classLabels` - A list of all possible class labels
        * `test` - The unseen data to classify
    * **Returns**:
        * A classification for the `test` instance

In [98]:
def naive_bayes_classify(train, classIndex, classLabels, test):
    classProbabilities = []
    for val in classLabels:
        classProbabilities.append(0)
    priors = calculate_priors(train, classIndex, classLabels)
    for i in range(len(classProbabilities)):
        classProbabilities[i] += priors[i]
    for i in range(len(test)):
        if i == classIndex:
            continue
        else:
            attribute = test[i]
            posteriors = calculate_posteriors(train, i, attribute, classIndex, classLabels)
            for i in range(len(posteriors)):
                classProbabilities[i] *= posteriors[i]
    maxP = 0
    index = 0
    for i in range(len(classProbabilities)):
        if classProbabilities[i] > maxP:
            maxP = classProbabilities[i]
            index = i
    return classLabels[index]
            

To test that our classifier is functioning as intended, we will test it on the trains dataset using the unseen value ("weekday", "winter", "high", "heavy", "???"). Accordign to Bramer, this should be classified as "very late".

In [99]:
label = naive_bayes_classify(table, 4, ["on time", "late", "very late", "cancelled"], ["weekday", "winter", "high", "heavy", "???"])
print(label)

very late


And, as shown, the `naive_bayes_classify()` function does indeed classify the unseen data correctly. Therefore, we now have a functioning method for using Naive Bayes classification accross a given dataset.

## Step 2 - MPG predictor

For this step, we will use our Naive Bayes methods on the auo-data dataset. First, we will import the data into an array titled `auto_data`. To generate this data, we will reuse the functions `read_data()`, `create_dataset()`, and `resolve_missing()` from PA4.

In [100]:
def read_data(filename):
    f = open(filename, 'r')
    text = f.read()
    f.close()
    return text

def create_dataset(data):
    data_r = data.splitlines()
    dataset = []
    for line in data_r:
        instance = line.split(',')
        dataset.append(instance)
    for instance in dataset:
        for i in range(10):
            try:
                instance[i] = float(instance[i])
            except:
                instance[i] = instance[i]
    return dataset

def resolve_missing_values(data):
    for i in range(10):
        if i != 8:
            sum_i = 0
            count_i = 0
            for instance in data:
                if instance[i] != "NA":
                    try:
                        sum_i += instance[i]
                        count_i += 1
                    except:
                        print(instance[i])
            if count_i == 0:
                continue
            mean = sum_i / count_i
            for instance in data:
                if instance[i] == "NA":
                    instance[i] = mean

Then, we will use these functions to populate `auto_data`.

In [101]:
auto_data = create_dataset(read_data("auto-data.txt"))
resolve_missing_values(auto_data)

This step only cares about the cyliders, weight, and model year attributes, as well as mpg as a classifier. So, to clean the dataset, we will first go through and restrict it to only these values.

In [102]:
def clean_auto_data(data):
    cleaned_auto_data = []
    for instance in data:
        cleaned_instance = [instance[1], instance[4], instance[6], instance[0]]
        cleaned_auto_data.append(cleaned_instance)
    return cleaned_auto_data

auto_data = clean_auto_data(auto_data)

Then, we will go through and discretize mpg based on the DOE classification ranking, as well as weight based on the NHTSA vehicle sizes classification. Both tables are given below for reference.

| Rating | MPG   |
|--------|-----  |
|   10   | ≥ 45  |
|   9    | 37-44 |
|   8    | 31-36 |
|   7    | 27-30 |
|   6    | 24-26 |
|   5    | 20-23 |
|   4    | 17-19 |
|   3    | 15-16 |
|   2    |   14  |
|   1    | ≤ 13  |

| Ranking |  Weight   |
|---------|-----------|
|    5    | ≥ 3500    |
|    4    | 3000-3499 |
|    3    | 2500-2999 |
|    2    | 2000-2499 |
|    1    | ≤ 1999    |

In [103]:
def mpg_to_DOE(mpg):
    if mpg >= 45:
        y = 10
    elif mpg >= 37:
        y = 9
    elif mpg >= 31:
        y = 8
    elif mpg >= 27:
        y = 7
    elif mpg >= 24:
        y = 6
    elif mpg >= 20:
        y = 5
    elif mpg >= 17:
        y = 4
    elif mpg >= 15:
        y = 3
    elif mpg >= 14:
        y = 2
    else:
        y = 1
    return y

def weight_to_NHTSA(weight):
    if weight >= 3500:
        return 5
    elif weight >= 3000:
        return 4
    elif weight >= 2500:
        return 3
    elif weight >= 2000:
        return 2
    else:
        return 1

    
def discretize_auto_data(data):
    discrete_data = []
    for instance in data:
        discrete = [instance[0], weight_to_NHTSA(instance[1]), instance[2], mpg_to_DOE(instance[3])]
        discrete_data.append(discrete)
    return discrete_data

auto_data = discretize_auto_data(auto_data)

We will now test our classifier by repeating steps 2-5 from PA4.

### Random Instances

First, we will test our classifier on a subset of 5 random instances from the dataset. To do so, we must first generate 5 random instances to test. We will reuse `generate_random_instances()` from PA4.

In [111]:
from random import randint
import copy

def generate_random_instances(data, n):
    data_c = copy.deepcopy(data)
    test_instances = []
    for i in range(n):
        index = randint(0, len(data_c)-1)
        instance = data_c.pop(index)
        test_instances.append(instance)
    return test_instances

random_test = generate_random_instances(auto_data, 5)

Then, we will use our `naive_bayes_classify()` function to classify each test instance.

In [119]:
def generate_predictions(train, test):
    predictions = []
    for instance in test:
        predictions.append(naive_bayes_classify(train, 3, [1, 2, 3, 4, 5, 6, 7, 8, 9, 10], instance))
    return predictions

def print_output(test, predictions):
    for x in range(len(test)):
        print("instance: ", test[x][0], ", ", test[x][1], ", ", test[x][2], sep="")
        print("predicted: ", predictions[x], ", actual: ", test[x][3], sep="")
        print()

Now that we can generate a list of predictions, we will run the classifier and output the results.

In [120]:
predictions = generate_predictions(auto_data, random_test)
print_output(random_test, predictions)


instance: 4.0, 2, 77.0
predicted: 7, actual: 6

instance: 4.0, 3, 76.0
predicted: 6, actual: 6

instance: 6.0, 4, 74.0
predicted: 4, actual: 5

instance: 8.0, 4, 78.0
predicted: 4, actual: 4

instance: 8.0, 5, 73.0
predicted: 1, actual: 1



### Test / Train sets

For this test, we will use stratified cross validation to create a 2:1 train/test set. To generate the train and test sets, we will once again reuse functions from PA4: `create_random_subsample()`, `compute_accuracy()`, and `compute_error()`.

In [130]:
from random import shuffle

def create_random_subsample(data, size):
    cutoff = int(len(data) * size)
    data_c = copy.deepcopy(data)
    shuffle(data_c)
    train = data_c[:cutoff]
    test = data_c[cutoff + 1:]
    return test, train

def compute_accuracy(predictions, actual):
    correct = 0
    for i in range(len(predictions)):
        if predictions[i] == actual[i]:
            correct += 1
    accuracy = correct / len(predictions)
    return accuracy

def compute_error(predictions, actual):
    incorrect = 0
    for i in range(len(predictions)):
        if predictions[i] != actual[i]:
            incorrect += 1
    error = incorrect / len(predictions)
    return error

Then we can use these functions to generate predictions and print the accuracy and the error rate.

In [131]:
test, train = create_random_subsample(auto_data, 2/3)
predictions = generate_predictions(train, test)
accuracy = compute_accuracy(predictions, [x[3] for x in test])
error = compute_error(predictions, [x[3] for x in test])

print("Accuracy: ", accuracy)
print("Error Rate: ", error)

Accuracy:  0.4519230769230769
Error Rate:  0.5480769230769231


### Stratified Cross Validation

Finally, we will use Stratified 10-Fold Cross Validation to generate random subsamples of test / train data to run our classifier on. 

We will borrow the function `create_cross_fold()` from PA4 for doing the subsampling.

In [138]:
def create_cross_fold(data, n):
    data_r = copy.deepcopy(data)
    shuffle(data_r)
    size = int(len(data) * 1/n) 
    start = 0
    end = size
    folds = []
    for i in range(n-1):
        folds.append(data[start:end])
        start = end + 1
        end += size + 1
    folds.append(data[start:])
    return folds

We now need to run the classifier 10 times, using each subsequent fold as test data and the other 9 folds as training data. For each fold, we will compute an accuracy and error rate, and then display the average at the end.

In [139]:
def print_cross_fold_output(data, n):
    sum_accuracy = 0
    sum_error = 0
    folds = create_cross_fold(data, n)
    for i in range(n):
        test = folds[i]
        train = []
        for x in range(n):
            if x == i:
                continue
            train += folds[x]
        predictions = generate_predictions(train, test)
        sum_accuracy += compute_accuracy(predictions, [x[3] for x in test])
        sum_error += compute_error(predictions, [x[3] for x in test])
    accuracy = sum_accuracy / n
    error = sum_error / n
    print("Accuracy:", accuracy)
    print("Error Rate:", error)
    
print_cross_fold_output(auto_data, 10)

Accuracy: 0.3471923536439666
Error Rate: 0.6528076463560335
