# PA5 - Naive Bayes

## Step 1 - Train Dataset

For this step, we will create a Naive Bayes classifier for the "train" dataset, provided below. 

In [50]:
table = [
    ["weekday", "spring", "none", "none", "on time"],
    ["weekday", "winter", "none", "slight", "on time"],
    ["weekday", "winter", "none", "slight", "on time"],
    ["weekday", "winter", "high", "heavy", "late"], 
    ["saturday", "summer", "normal", "none", "on time"],
    ["weekday", "autumn", "normal", "none", "very late"],
    ["holiday", "summer", "high", "slight", "on time"],
    ["sunday", "summer", "normal", "none", "on time"],
    ["weekday", "winter", "high", "heavy", "very late"],
    ["weekday", "summer", "none", "slight", "on time"],
    ["saturday", "spring", "high", "heavy", "cancelled"],
    ["weekday", "summer", "high", "slight", "on time"],
    ["saturday", "winter", "normal", "none", "late"],
    ["weekday", "summer", "high", "none", "on time"],
    ["weekday", "winter", "normal", "heavy", "very late"],
    ["saturday", "autumn", "high", "slight", "on time"],
    ["weekday", "autumn", "none", "heavy", "on time"],
    ["holiday", "spring", "normal", "slight", "on time"],
    ["weekday", "spring", "normal", "none", "on time"],
    ["weekday", "spring", "normal", "slight", "on time"]
]

To create this classifier, we first must calculate prior probabilities for each class label.
We will do this with the following helper function:

* `calculate_priors()`
    * **Params**:
        * `data` - The dataset to calculate prior probabilities for
        * `index` - The index of the classifier value in the dataset
        * `classes` - An array of the classes present in the dataset
    * **Returns**:
        * An array of prior probability values, ordered according to the `classes` param.

In [53]:
def calculate_priors(data, index, classes):
    counts = {}
    total = 0
    for label in classes:
        counts[label] = 0
    for instance in data:
        counts[instance[index]] += 1
        total += 1
    probabilities = []
    for label in classes:
        probabilities.append(counts[label] / total)
    return probabilities

To check these values, we will refer to **Figure 3.2** from the Bramer textbook, which claims that the Prior Probabilities for "on time", "late" "very late", and "cancelled" should be 0.70, 0.10, 0.15, and 0.05, respectively.

We will run our `calculatePriors()` function on these labels and display the values, which should match the given probabilities.

In [54]:
priors = calculate_priors(table, 4, ["on time", "late", "very late", "cancelled"])
print(priors)

[0.7, 0.1, 0.15, 0.05]


As we can see, the values returned from `calculate_priors()` are the same as given in Bramer. Next, we will calculate the posterior probabilities for a given class label. Again, we will create a helper function `calculate_posteriors()` to find these values:

* `calculate_posteriors()`
    * **Params**:
        * `data` - The dataset to calculate probabilities for
        * `attributeIndex` - The index of the attribute to calculate conditional probability for
        * `attribute` - The value of the index to calculate conditional probability for
        * `classIndex` - The index of the class label
        * `classLabels` - All classifier labels to calculate conditional probabilities for
    * **Returns**:
        * A list of posterior probabilities for the given attribute over each class label, ordered with respect to `classLabels()`

In [55]:
def calculate_posteriors(data, attributeIndex, attribute, classIndex, classLabels):
    conditionalCounts = {}
    counts = {}
    for label in classLabels:
        counts[label] = 0
        conditionalCounts[label] = 0
    for instance in data:
        counts[instance[classIndex]] += 1
        if instance[attributeIndex] == attribute:
            conditionalCounts[instance[classIndex]] += 1
    probabilities = []
    for label in classLabels:
        probabilities.append(conditionalCounts[label] / counts[label])
    return probabilities

Once again, we will check our function output against the values provided by Bramer. For the attribute (day = "weekday"), we expect the posterior probabilities for class = "on time", "late", "very late", and "cancelled" to be 0.64, 0.5, 1, and 0, respectively.

In [56]:
posteriors = calculate_posteriors(table, 0, "weekday", 4, ["on time", "late", "very late", "cancelled"])
print(posteriors)

[0.6428571428571429, 0.5, 1.0, 0.0]


Again, our values match up, although Bramer's values round off to 2 decimal places while ours sometimes have more bits of precision.

Finally, we can use this to create a Naive Bayes classifier function.

* `naive_bayes_classify()`
    * **Params**:
        * `train` - The data to use as training data
        * `classIndex` - The index of the class label
        * `classLabels` - A list of all possible class labels
        * `test` - The unseen data to classify
    * **Returns**:
        * A classification for the `test` instance

In [58]:
def naive_bayes_classify(train, classIndex, classLabels, test):
    classProbabilities = []
    for val in classLabels:
        classProbabilities.append(0)
    priors = calculate_priors(train, classIndex, classLabels)
    for i in range(len(classProbabilities)):
        classProbabilities[i] += priors[i]
    for i in range(len(test)):
        if i == classIndex:
            continue
        else:
            attribute = test[i]
            posteriors = calculate_posteriors(train, i, attribute, classIndex, classLabels)
            for i in range(len(posteriors)):
                classProbabilities[i] *= posteriors[i]
    maxP = 0
    index = 0
    for i in range(len(classProbabilities)):
        if classProbabilities[i] > maxP:
            maxP = classProbabilities[i]
            index = i
    return classLabels[index]
            

To test that our classifier is functioning as intended, we will test it on the trains dataset using the unseen value ("weekday", "winter", "high", "heavy", "???"). Accordign to Bramer, this should be classified as "very late".

In [59]:
label = naive_bayes_classify(table, 4, ["on time", "late", "very late", "cancelled"], ["weekday", "winter", "high", "heavy", "???"])
print(label)

very late


And, as shown, the `naive_bayes_classify()` function does indeed classify the unseen data correctly. Therefore, we now have a functioning method for using Naive Bayes classification accross a given dataset.

## Step 2 - MPG predictor

For this step, we will use our Naive Bayes methods on the auo-data dataset. First, we will import the data into an array titled `auto_data`. To generate this data, we will reuse the functions `read_data()`, `create_dataset()`, and `resolve_missing()` from PA4.

In [69]:
def read_data(filename):
    f = open(filename, 'r')
    text = f.read()
    f.close()
    return text

def create_dataset(data):
    data_r = data.splitlines()
    dataset = []
    for line in data_r:
        instance = line.split(',')
        dataset.append(instance)
    for instance in dataset:
        for i in range(10):
            try:
                instance[i] = float(instance[i])
            except:
                instance[i] = instance[i]
    return dataset

def resolve_missing_values(data):
    for i in range(10):
        if i != 8:
            sum_i = 0
            count_i = 0
            for instance in data:
                if instance[i] != "NA":
                    try:
                        sum_i += instance[i]
                        count_i += 1
                    except:
                        print(instance[i])
            if count_i == 0:
                continue
            mean = sum_i / count_i
            for instance in data:
                if instance[i] == "NA":
                    instance[i] = mean

Then, we will use these functions to populate `auto_data`.

In [70]:
auto_data = create_dataset(read_data("auto-data.txt"))
resolve_missing_values(auto_data)

This step only cares about the cyliders, weight, and model year attributes, as well as mpg as a classifier. So, to clean the dataset, we will first go through and restrict it to only these values.

In [71]:
def clean_auto_data(data):
    cleaned_auto_data = []
    for instance in data:
        cleaned_instance = [instance[1], instance[4], instance[6], instance[0]]
        cleaned_auto_data.append(cleaned_instance)
    return cleaned_auto_data

auto_data = clean_auto_data(auto_data)

Then, we will go through and discretize mpg based on the DOE classification ranking, as well as weight based on the NHTSA vehicle sizes classification. Both tables are given below for reference.

| Rating | MPG   |
|--------|-----  |
|   10   | ≥ 45  |
|   9    | 37-44 |
|   8    | 31-36 |
|   7    | 27-30 |
|   6    | 24-26 |
|   5    | 20-23 |
|   4    | 17-19 |
|   3    | 15-16 |
|   2    |   14  |
|   1    | ≤ 13  |

| Ranking |  Weight   |
|---------|-----------|
|    5    | ≥ 3500    |
|    4    | 3000-3499 |
|    3    | 2500-2999 |
|    2    | 2000-2499 |
|    1    | ≤ 1999    |

In [None]:
def mpg_to_DOE(mpg):
    if mpg >= 45:
        y = 10
    elif mpg >= 37:
        y = 9
    elif mpg >= 31:
        y = 8
    elif mpg >= 27:
        y = 7
    elif mpg >= 24:
        y = 6
    elif mpg >= 20:
        y = 5
    elif mpg >= 17:
        y = 4
    elif mpg >= 15:
        y = 3
    elif mpg >= 14:
        y = 2
    else:
        y = 1
    return y

def weight_to_NHTSA(weight):
    if weight >= 3500:
        return 5
    elif weight >= 3000:
        return 4
    elif weight >= 2500:
        return 3
    elif weight >= 2000:
        return 2
    else:
        return 1

    
def discretize_auto_data(data):
    discrete_data = []
    for instance in data:
        discrete = [instance[0], weight_to_NHTSA(instance[1]), instance[2], mpg_to_DOE(instance[3])]
        discrete_data.append(discrete)
    return discrete_data