### The University of Melbourne, School of Computing and Information Systems
# COMP30027 Machine Learning, 2022 Semester 1

## Assignment 1: Naive Bayes Leaner for Adult Database


**Student Name(s):** `Xi Chen, Yu Cao`
<br>
**Student ID(s):** `1213849, 1043108`



Marking will be applied on the four functions that are defined in this notebook, and to your responses to the questions at the end of this notebook.

## General info

<b>Due date</b>: Friday, 8 April 2022 7pm

<b>Submission method</b>: Canvas submission

<b>Submission materials</b>: This iPython notebook is a template which you will use for your Assignment 1 submission. You need to only submitted the completed copy of this iPython notebook.

<b>Late submissions</b>: -10% per day up to 5 days (both weekdays and weekends count). Submissions more than 5 days late will not be accepted (resul in a mark of 0).
<ul>
    <li>one day late, -1.0;</li>
    <li>two days late, -2.0;</li>
    <li>three days late, -3.0;</li>
    <li>four days late, -4.0;</li>
    <li>five days late, -5.0;</li>
</ul>

<b>Extensions</b>: Students who are demonstrably unable to submit a full solution in time due to medical reasons or other trauma, may apply for an extension.  In these cases, you should email <a href="mailto:ni.ding@unimelb.edu.au">Ni Ding</a> as soon as possible after those circumstances arise. If you attend a GP or other health care service as a result of illness, be sure to provide a Health Professional Report (HPR) form (get it from the Special Consideration section of the Student Portal), you will need this form to be filled out if your illness develops into something that later requires a Special Consideration application to be lodged. You should scan the HPR form and send it with the extension requests.

<b>Marks</b>: This assignment will be marked out of 20, and make up 20% of your overall mark for this subject.

<b>Materials</b>: Use Jupyter Notebook and Python page on Canvas for information on the basic setup required for this class, including an iPython notebook viewer and the python packages NLTK, Numpy, Scipy, Matplotlib, Scikit-Learn. You can use any Python built-in packages, but do not use any other 3rd party packages; if your iPython notebook doesn't run on the marker's machine, you will lose marks. <b> You should use Python 3</b>.  


<b>Evaluation</b>: Your iPython notebook should run end-to-end without any errors in a reasonable amount of time, and you must follow all instructions provided below, including specific implementation requirements and instructions for what needs to be printed (please avoid printing output we don't ask for). You should edit the sections below where requested, but leave the rest of the code as is. You should leave the output from running your code in the iPython notebook you submit, to assist with marking. The amount each section is worth is given in parenthesis after the instructions. 

You will be marked not only on the correctness of your methods, but also the quality and efficency of your code: in particular, you should be careful to use Python built-in functions and operators when appropriate and pick descriptive variable names that adhere to <a href="https://www.python.org/dev/peps/pep-0008/">Python style requirements</a>. If you think it might be unclear what you are doing, you should comment your code to help the marker make sense of it. We reserve the right to deduct up to 2 marks for unreadable or exessively inefficient code.

8 of the marks available for this Project will be assigned to whether the four specified Python functions work in a manner consistent with the materials from COMP30027. Any other implementation will not be directly assessed (except insofar as it is required to make these five functions work correctly).

12 of the marks will be assigned to your responses to the questions, in terms of both accuracy and insightfulness. We will be looking for evidence that you have an implementation that allows you to explore the problem, but also that you have thought deeply about the data and the behaviour of the Naive Bayes classifier.

<b>Updates</b>: Any major changes to the assignment will be announced via Canvas. Minor changes and clarifications will be announced on the discussion board (ED -> Assignments -> A1); we recommend you check it regularly.

<b>Academic misconduct</b>: For most people, collaboration will form a natural part of the undertaking of this homework, and we encourge you to discuss it in general terms with other students. However, this ultimately is still an individual task, and so reuse of code or other instances of clear influence will be considered cheating. Please check the <a href="https://canvas.lms.unimelb.edu.au/courses/124196/modules#module_662096">CIS Academic Honesty training</a> for more information. We will be checking submissions for originality and will invoke the University’s <a href="http://academichonesty.unimelb.edu.au/policy.html">Academic Misconduct policy</a> where inappropriate levels of collusion or plagiarism are deemed to have taken place.

**IMPORTANT**

Please carefully read and fill out the <b>Authorship Declaration</b> form at the bottom of the page. Failure to fill out this form results in the following deductions: 
<UL TYPE=”square”>
<LI>missing Authorship Declaration at the bottom of the page, -5.0
<LI>incomplete or unsigned Authorship Declaration at the bottom of the page, -3.0
</UL>
**NOTE: COMPLETE AND SUBMIT THIS FILE. YOU SHOULD IMPLEMENT FOUR FUNCTIONS AND INCLUDE YOUR ANSWERS TO THE QUESTIONS IN THIS FILE ONLY. NO OTHER SUBMISSION IS REQUIRED.**

**Keep your code clean. Adding proper comments to your code is MANDATORY.**

## Part 1: Base code [8 marks]

Instructions
1. Do **not** shuffle the data set
2. Treat the attributes as they are(e.g., do **not** convert numeric attributes to categorical or categorical to numeric). Implement a Naive Bayes classifier with appropriate likelihood function for each attribute.
3. You should implement the Naive Bayes classifier from scratch. Do **not** use existing implementations/learning algorithms.
4. You CANNOT have more than one train or predict function. Both continuous numeric attributes and categorical ones should be trained in one `train()` function, similarly for the `predict()`.  
5. Apart from the instructions in point 3, you may use libraries to help you with data reading, representation, maths or evaluation
6. Ensure that all and only required information is printed, as indicated in the final three code cells. Failure to adhere to print the required information will result in **[-1 mark]** per case. *(We don't mind details like you print a list or several numbers -- just make sure the information is displayed so that it's easily accessible)
7. You may change the prototypes of these functions, and you may write other functions, according to your requirements. We would appreciate it if the required functions were prominent/easy to find. 
8. You should add adequate comments to make your code easily comprehendible.*

In [29]:
# Cell for import only
import pandas as pd
from sklearn.model_selection import train_test_split
from collections import defaultdict
import math
import numpy as np
from sklearn.model_selection import cross_val_score

In [2]:
# This function should prepare the data by reading it from a file and converting it into a useful format for training and testing
# and implement 90-10 splitting as specified in the project description.
def preprocess(filename):
    data = pd.read_csv(filename)
    y = data['label']
    X = data.iloc[:, :-1]
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1, shuffle=False)

    X_test.index -= 900
    y_test.index -= 900

    return X_train, X_test, y_train, y_test

In [3]:
# This function should calculat prior probabilities and likelihoods (conditional probabilities) from the training data and using
# to build a naive Bayes model

def train(X, y):
    # Record frequency for each cj
    cj_freq = defaultdict(float)
    for i in range(len(y)):
        cj_freq[y[i]] += 1
    # Calculate the total appearence
    total_appearence = 0
    for key in cj_freq.keys():
        total_appearence += cj_freq[key]
    # Calculate the probability for each cj
    for key in cj_freq.keys():
        cj_freq[key] = cj_freq[key] / total_appearence

    # Create a dictionary for every type of feature
    feature_dict = {}
    for key in cj_freq.keys():
        feature_dict[key] = {}
        for feature in X.keys():
            if ((not feature == "age") and (not feature == "education num") and \
                (not feature == "hours per week")): # Skip the numeric attributes
                feature_dict[key][feature] = defaultdict(float)
        feature_dict[key]["age"] = {"mean": 0, "sd": 0}
        feature_dict[key]["education num"] = {"mean": 0, "sd": 0}
        feature_dict[key]["hours per week"] = {"mean": 0, "sd": 0}

    # Record frequency
    for key in X.keys():
        for i in range(len(X[key])):
            if ((not key == "age") and (not key == "education num") and \
                (not key == "hours per week")): # Skip the numeric attributes
                feature_dict[y[i]][key][X[key][i]] += 1

    # Calculate conditional probability for each type of feature
    for key in feature_dict.keys():
        for feature in feature_dict[key]:
            if ((not feature == "age") and (not feature == "education num") and \
                (not feature == "hours per week")):
                sum = 0
                for type in feature_dict[key][feature]:
                    sum += feature_dict[key][feature][type]
                for type in feature_dict[key][feature]:
                    feature_dict[key][feature][type] /= sum

    # Calculate the mean for numeric features
    for key in feature_dict.keys():
        for i in range(len(X)):
            if y[i] == key:
                feature_dict[key]["age"]["mean"] += X["age"][i]
                feature_dict[key]["education num"]["mean"] += X["education num"][i]
                feature_dict[key]["hours per week"]["mean"] += X["education num"][i]
        feature_dict[key]["age"]["mean"] /= cj_freq[key] * total_appearence
        feature_dict[key]["education num"]["mean"] /= cj_freq[key] * total_appearence
        feature_dict[key]["hours per week"]["mean"] /= cj_freq[key] * total_appearence

    # Calculate the standard deviation for numeric features
    for key in feature_dict.keys():
        for i in range(len(X)):
            if y[i] == key:
                feature_dict[key]["age"]["sd"] += (X["age"][i] - feature_dict[key]["age"]["mean"]) ** 2
                feature_dict[key]["education num"]["sd"] += (X["age"][i] - feature_dict[key]["education num"]["mean"]) ** 2
                feature_dict[key]["hours per week"]["sd"] += (X["age"][i] - feature_dict[key]["hours per week"]["mean"]) ** 2
        feature_dict[key]["age"]["sd"] = math.sqrt(feature_dict[key]["age"]["sd"] / (cj_freq[key] * total_appearence - 1))
        feature_dict[key]["education num"]["sd"] = math.sqrt(feature_dict[key]["education num"]["sd"] / (cj_freq[key] * total_appearence - 1))
        feature_dict[key]["hours per week"]["sd"] = math.sqrt(feature_dict[key]["hours per week"]["sd"] / (cj_freq[key] * total_appearence - 1))

    return {"cj_freq": cj_freq, "feature_dict": feature_dict}

In [4]:
# This function should calculat prior probabilities and likelihoods (conditional probabilities) from the training data and using
# to build a naive Bayes model

def train_kde(X, y):
    # Record frequency for each cj
    cj_freq = defaultdict(float)
    for i in range(len(y)):
        cj_freq[y[i]] += 1
    # Calculate the total appearence
    total_appearence = 0
    for key in cj_freq.keys():
        total_appearence += cj_freq[key]
    # Calculate the probability for each cj
    for key in cj_freq.keys():
        cj_freq[key] = cj_freq[key] / total_appearence

    # Create a dictionary for every type of feature
    feature_dict = {}
    for key in cj_freq.keys():
        feature_dict[key] = {}
        for feature in X.keys():
            if ((not feature == "age") and (not feature == "education num") and \
                (not feature == "hours per week")): # Skip the numeric attributes
                feature_dict[key][feature] = defaultdict(float)
        feature_dict[key]["age"] = []
        feature_dict[key]["education num"] = []
        feature_dict[key]["hours per week"] = []

    # Record frequency
    for key in X.keys():
        for i in range(len(X[key])):
            if ((not key == "age") and (not key == "education num") and \
                (not key == "hours per week")): # Skip the numeric attributes
                feature_dict[y[i]][key][X[key][i]] += 1

    # Calculate conditional probability for each type of feature
    for key in feature_dict.keys():
        for feature in feature_dict[key]:
            if ((not feature == "age") and (not feature == "education num") and \
                (not feature == "hours per week")):
                sum = 0
                for type in feature_dict[key][feature]:
                    sum += feature_dict[key][feature][type]
                for type in feature_dict[key][feature]:
                    feature_dict[key][feature][type] /= sum

    # Record all numeric values
    for key in feature_dict.keys():
        for i in range(len(X)):
            if y[i] == key:
                feature_dict[key]["age"].append(X["age"][i])
                feature_dict[key]["education num"].append(X["education num"][i])
                feature_dict[key]["hours per week"].append(X["hours per week"][i])

    return {"cj_freq": cj_freq, "feature_dict": feature_dict}

In [5]:
# This function should predict classes for new items in the testing data
def predict(GNB_model, X_test):
    y_predict = []
    for i in range(len(X_test)):
        predict_dict = {}
        for key in GNB_model["cj_freq"].keys():
            predict_dict[key] = []
        # Record probability for every feature
        for feature in X_test.iloc[i].keys():
            if feature == "age" or feature == "education num" or feature == "hours per week":
                for key in GNB_model["cj_freq"].keys():
                    sd = GNB_model["feature_dict"][key][feature]["sd"]
                    mean = GNB_model["feature_dict"][key][feature]["mean"]
                    x = X_test.iloc[i][feature]
                    # Calculate probability of numeric features using the formula
                    predict_dict[key].append((1 / (sd * math.sqrt(2 * math.pi))) * math.e ** \
                        (-0.5 * (((x - mean) / sd) ** 2)))
            else:
                for key in GNB_model["cj_freq"].keys():
                    predict_dict[key].append(GNB_model["feature_dict"][key][feature][X_test.iloc[i][feature]] + 0.00000000000000001) # Add extremely small number to prevent log(0)
        
        # Probability comparison
        less_50_prob = np.log(GNB_model["cj_freq"][" <=50K"]) + sum(np.log(predict_dict[" <=50K"]))
        more_50_prob = np.log(GNB_model["cj_freq"][" >50K"]) + sum(np.log(predict_dict[" >50K"]))
        
        if less_50_prob > more_50_prob:
            y_predict.append((" <=50K", less_50_prob, more_50_prob))
        else:
            y_predict.append((" >50K", less_50_prob, more_50_prob))

    return y_predict

In [17]:
# This function should predict classes for new items in the testing data
def predict_kde(GNB_model, X_test, bandwidth):
    y_predict = []
    for i in range(len(X_test)):
        predict_dict = {}
        for key in GNB_model["cj_freq"].keys():
            predict_dict[key] = []
        # Record probability for every feature
        for feature in X_test.iloc[i].keys():
            if feature == "age" or feature == "education num" or feature == "hours per week":
                for key in GNB_model["cj_freq"].keys():
                    n = len(GNB_model["feature_dict"][key][feature])
                    x = X_test.iloc[i][feature]
                    result = 0
                    for xi in GNB_model["feature_dict"][key][feature]:
                        result += (1/(bandwidth * math.sqrt(2 * math.pi))) * math.e ** (-0.5 * (((x-GNB_model["feature_dict"][key][feature][xi]) / bandwidth) ** 2))
                    result /= n
                    predict_dict[key].append(result)

            else:
                for key in GNB_model["cj_freq"].keys():
                    predict_dict[key].append(GNB_model["feature_dict"][key][feature][X_test.iloc[i][feature]] + 0.00000000000000001) # Add extremely small number to prevent log(0)
        
        # Probability comparison
        less_50_prob = np.log(GNB_model["cj_freq"][" <=50K"]) + sum(np.log(predict_dict[" <=50K"]))
        more_50_prob = np.log(GNB_model["cj_freq"][" >50K"]) + sum(np.log(predict_dict[" >50K"]))
        
        if less_50_prob > more_50_prob:
            y_predict.append((" <=50K", less_50_prob, more_50_prob))
        else:
            y_predict.append((" >50K", less_50_prob, more_50_prob))

    return y_predict

In [7]:
# This function should evaliate the prediction performance by comparing your model’s class outputs to ground
# truth labels, return and output accuracy, confusion matrix and F1 score.

def evaluate(y_predict, y_test):
    tp = 0
    tn = 0
    fp = 0
    fn = 0
    for i in range(len(y_predict)):
        if y_predict[i][0] == " <=50K":
            if y_test[i] == y_predict[i][0]:
                tp += 1
            else:
                fp += 1
        else:
            if y_test[i] == y_predict[i][0]:
                tn += 1
            else:
                fn += 1
    
    accuracy = (tp+tn) / (tp+fp+fn+tn)
    precision = tp / (tp+fp)
    recall = tp / (tp+fn)
    f1_score = (2*precision*recall) / (precision+recall)
    confusion_matrix = pd.DataFrame([[tp, fn], [fp, tn]])
    
    result_dict = {"accuracy": accuracy, "matrix": confusion_matrix, "f1": f1_score}
    return result_dict

In [20]:
# This cell should act as your "main" function where you call the above functions 
# on the full ADULT data set, and print the evaluation results. [0.33 marks]



# First, read in the data and apply your NB model to the ADULT data
data = preprocess("./dataset/adult.csv")
X_train = data[0]
X_test = data[1]
y_train = data[2]
y_test = data[3]

model = train(X_train, y_train)

y_predict = predict(model, X_test)





# Second, print the full evaluation results from the evaluate() function
result = evaluate(y_predict, y_test)





# Third, print data statistics and model predictions, as instructed below 
# N is the total number of instances, F the total number of attributes, L the total number of labels
# The "class probabilities" may be unnormalized
# The "predicted class ID" must be in range (0, L)

print("Attribute vectors of instances [0, 1, 2]: ", pd.read_csv("./dataset/adult.csv").head(3)) # of the first three records in adult.csv

print("\nNumber of instances (N): ", len(X_train) + len(X_test))
print("Number of attributes (F): ", len(X_train.keys()))
print("Number of labels (L): ", len(model["cj_freq"].keys()))


# print out the prediction results of the last three instances
print("\n\nPredicted class log-probabilities for instance N-3: ", "<=50K:", y_predict[-3][1], ">50K:", y_predict[-3][2])
print("Predicted class ID for instance N-3: ", y_predict[-3][0])
print("\nPredicted class log-probabilities for instance N-2: ", "<=50K:", y_predict[-2][1], ">50K:", y_predict[-2][2])
print("Predicted class ID for instance N-2: ", y_predict[-2][0])
print("\nPredicted class log-probabilities for instance N-1: ", "<=50K:", y_predict[-1][1], ">50K:", y_predict[-1][2])
print("Predicted class ID for instance N-1: ", y_predict[-1][0])

result

Attribute vectors of instances [0, 1, 2]:     age         work class   education  education num       marital status  \
0   68                  ?     1st-4th              2             Divorced   
1   39          State-gov   Bachelors             13        Never-married   
2   50   Self-emp-not-inc   Bachelors             13   Married-civ-spouse   

         occupation    relationship    race      sex  hours per week  \
0                 ?   Not-in-family   White   Female              20   
1      Adm-clerical   Not-in-family   White     Male              40   
2   Exec-managerial         Husband   White     Male              13   

  native country (region)   label  
0           United-States   <=50K  
1           United-States   <=50K  
2           United-States   <=50K  

Number of instances (N):  1000
Number of attributes (F):  11
Number of labels (L):  2


Predicted class log-probabilities for instance N-3:  <=50K: -23.064978751063983 >50K: -23.424255029245717
Predicted class ID f

{'accuracy': 0.83,
 'matrix':     0   1
 0  64  13
 1   4  19,
 'f1': 0.8827586206896552}

In [27]:
# This cell should act as your "main" function where you call the above functions 
# on the full ADULT data set, and print the evaluation results. [0.33 marks]



# First, read in the data and apply your NB model to the ADULT data
data = preprocess("./dataset/adult.csv")
X_train = data[0]
X_test = data[1]
y_train = data[2]
y_test = data[3]

model = train_kde(X_train, y_train)

y_predict = predict_kde(model, X_test, 15)





# Second, print the full evaluation results from the evaluate() function
result = evaluate(y_predict, y_test)





# Third, print data statistics and model predictions, as instructed below 
# N is the total number of instances, F the total number of attributes, L the total number of labels
# The "class probabilities" may be unnormalized
# The "predicted class ID" must be in range (0, L)

print("Attribute vectors of instances [0, 1, 2]: ", pd.read_csv("./dataset/adult.csv").head(3)) # of the first three records in adult.csv

print("\nNumber of instances (N): ", len(X_train) + len(X_test))
print("Number of attributes (F): ", len(X_train.keys()))
print("Number of labels (L): ", len(model["cj_freq"].keys()))


# print out the prediction results of the last three instances
print("\n\nPredicted class log-probabilities for instance N-3: ", "<=50K:", y_predict[-3][1], ">50K:", y_predict[-3][2])
print("Predicted class ID for instance N-3: ", y_predict[-3][0])
print("\nPredicted class log-probabilities for instance N-2: ", "<=50K:", y_predict[-2][1], ">50K:", y_predict[-2][2])
print("Predicted class ID for instance N-2: ", y_predict[-2][0])
print("\nPredicted class log-probabilities for instance N-1: ", "<=50K:", y_predict[-1][1], ">50K:", y_predict[-1][2])
print("Predicted class ID for instance N-1: ", y_predict[-1][0])

result

Attribute vectors of instances [0, 1, 2]:     age         work class   education  education num       marital status  \
0   68                  ?     1st-4th              2             Divorced   
1   39          State-gov   Bachelors             13        Never-married   
2   50   Self-emp-not-inc   Bachelors             13   Married-civ-spouse   

         occupation    relationship    race      sex  hours per week  \
0                 ?   Not-in-family   White   Female              20   
1      Adm-clerical   Not-in-family   White     Male              40   
2   Exec-managerial         Husband   White     Male              13   

  native country (region)   label  
0           United-States   <=50K  
1           United-States   <=50K  
2           United-States   <=50K  

Number of instances (N):  1000
Number of attributes (F):  11
Number of labels (L):  2


Predicted class log-probabilities for instance N-3:  <=50K: -21.682123653884247 >50K: -22.038114578906434
Predicted class ID f

{'accuracy': 0.81,
 'matrix':     0   1
 0  62  15
 1   4  19,
 'f1': 0.8671328671328672}

## Part 2: Conceptual questions [8 marks for groups of 1] / [16 marks for groups of 2]


If you are in a group of 1, you should respond to Q1 and Q2.

If you are in a group of 2, you should respond to Q1, Q2, Q3 and Q4.

A response to a question should take about 100–250 words. You may need to develope codes or functions to help respond to the question here. 

#### NOTE: We strongly recommend <u>including figures or tables, etc.</u> to support your responses. The figures and tables inserted in Markdown cells must be reproducable by your code.

### Q1 [4 marks]
<u>Sensitivity</u> and <u>specificity</u> are two model evaluation metrics.  A good model should have both sensitivity and specificity high. Use the $2 \times 2$ confusion matrix returned by `evaluate()` to calculate the sensitivity and specificity. Do you see a difference between them? If so, what causes this difference? Provide suggestions to improve the model performance. 

In [9]:
# Write additional code here, if necessary (you may insert additional code cells)

# We know that the formula of calculating Sensitivity is TP / (TP + FN), and we have the confusion matrix from evaluate() function specified that 
# TP = 64, FP = 4, FN = 13, and TN = 19
# SO that we can know the Sensitivity value is 64 / (64 + 13) = 0.831

# We also know the formula of calculating Specificity is TN / (TN + FP), so that we have Specificity = 19 / (19 + 4) = 0.826
# Then we can conclude that the Sensitivity or True positive rate is 0.831, and the Specificity or True negative rate is 0.826

# A highly sensitive test means that there are few false negative results, 
# and a highly specific test means that there are few false positive results. 

# From these two data, we can find out that there are more Type One Error than Type Two Error since 0.826 < 0.831, and that's the reason why they are different.
# However, the sum of them are between 1.5 to 2.0, which means the prediction is much more reliable and useful.
# It's a hard way to balance the Sensitivity and the Specificity, we can use a receiver operating characteristic (ROC) curve to visualize
# both Sensitivity and Specificity in order to find a balance point to maximize the values and minimize the Type Errors.

Provide your text answer of 150-200 words in this cell.

### Q2 [4 marks]
You can adopt different methods for training and/or testing, which will produce different results in model evaluation. 

(a) Instead of Gaussian, <u>implement KDE</u> for  $P(X_i|c_j)$ for numeric attributes $X_i$. Compare the evaluation results with Gaussian. Which one do you think is more suitable to model $P(X_i|c_j)$, Gaussian or KDE? Observe all numeric attributes and justify your answer.

You can choose an arbitrary value for kernel bandwidth $\sigma$ for KDE, but a value between 3 and 15 is recommended. You should write code to implement KDE, not call an existing function/method such as `KernelDensity` from `scikit-learn`.

(b) Implement <u>10-fold and 2-fold cross-validations</u>.  
	Observe the evaluation results in each fold and the average accuracy, recall and specificity over all folds. 
	Comment on what is the effect by changing the values of $m$ in $m$-fold cross validation. (You can choose either Gaussian or KDE Naive Bayes.)

In [10]:
# Write additional code here, if necessary (you may insert additional code cells)

# For the calculation of KDE, we have constructed a predict_kde function as above, and we did a comparison between the results of Gaussian and KDE.

# We know that the Gaussian distribution or we call it normal distribution is the method that we used to compute the probability density function for numerical attributes.
# By Central Limit Theroem, we are more likely to use normal distribution when the sample size is large, then that's the first reason why the Gaussian distribution is more suitable than KDE
# since KDE captures more arbitrary parameters.

# After we did the predictions for both methods, we can know how accurate and precise of our outputs by analyzing the confusion matrices.
# The first code cell gives us the result of using Gaussian distribution, we have the accuracy = 0.83, and the F1 score = 0.883.
# And the following cell shows that the accuracy = 0.81, and the F1 score = 0.867 from KDE method.

# We can easily find that the prediction result of using Gaussian method is more accurate and reliable than KDE, which means Gaussian is more suitable for out dataset.

# When we doing the KDE coding, we notice that the kernel bandwidth need to be determined fro 3 to 15 by ourseleves, and the result will stay the same as the kernel bandwidth become larger than 15.
# It's obviously a disadvantage when we doing the calculation of KDE since we need to find the most proper value of kernel bandwidth.
# However, the Normal distribution can easily be used when the sample size is large, then we conclude that the Gaussian method is much more suitable than KDE for numeric attributes.

Provide your text answer of 150-200 words in this cell.

### Q3 [4 marks]
In `train()`, you are asked to treat the missing value of nominal attributes as a new category. There is another option (as suggested in Thu lecture in week 2): <u>ignoring the missing values</u>. 
Compare the two methods in both large and small datasets. Comment and explain your observations.
You can extract the first 50 records to construct a small dataset.Use Gaussian Naive Bayes only for this question.

In [11]:
# Write additional code here, if necessary (you may insert additional code cells)

Provide your text answer of 150-200 words in this cell.

### Q4 [4 marks]
In week 4, we have learned how to obtain information gain (IG) and gain ratio (GR) to choose an attribute to split a node in a decision tree. We will see how to apply them in the Naive Bayes classification.

(a) Compute the GR of each attribute $X_i$, relative to the class distribution. In the Na\"ive Bayes classifier, remove attributes in the ascending order of GR: first, remove $P(X_i|c_j)$ such that $X_i$ has the least GR; second, remove $P(X_{i'}|c_j)$ such that $X_{i'}$ has the second least GR,......, until there is only one $X_{i*}$ with the largest GR remaining in the maximand $P(c_j) P(X_{i^*} | c_j)$. Observe the <u>change of the accuracy for both Gaussian and KDE</u> (Choose bandwidth $\sigma=10$ for KDE).

(b) Compute the IG between each pair of attributes. Describe and explain your observations. Choose an attribute and implement an estimator to predict the value of `education num`. Explain why you choose this attribute. Enumerate two other examples that an attribute can be used to estimate the other and explain the reason.  

In [12]:
# Write additional code here, if necessary (you may insert additional code cells)

### (a)

Provide your text answer to **Question 4.a** of 100-150 words in this cell.

### (b)

Provide your text answer to **Question 4.b** of 150-200 words in this cell.

<b>Authorship Declaration</b>:

   (1) I certify that the program contained in this submission is completely
   my own individual work, except where explicitly noted by comments that
   provide details otherwise.  I understand that work that has been developed
   by another student, or by me in collaboration with other students,
   or by non-students as a result of request, solicitation, or payment,
   may not be submitted for assessment in this subject.  I understand that
   submitting for assessment work developed by or in collaboration with
   other students or non-students constitutes Academic Misconduct, and
   may be penalized by mark deductions, or by other penalties determined
   via the University of Melbourne Academic Honesty Policy, as described
   at https://academicintegrity.unimelb.edu.au.

   (2) I also certify that I have not provided a copy of this work in either
   softcopy or hardcopy or any other form to any other student, and nor will
   I do so until after the marks are released. I understand that providing
   my work to other students, regardless of my intention or any undertakings
   made to me by that other student, is also Academic Misconduct.

   (3) I further understand that providing a copy of the assignment
   specification to any form of code authoring or assignment tutoring
   service, or drawing the attention of others to such services and code
   that may have been made available via such a service, may be regarded
   as Student General Misconduct (interfering with the teaching activities
   of the University and/or inciting others to commit Academic Misconduct).
   I understand that an allegation of Student General Misconduct may arise
   regardless of whether or not I personally make use of such solutions
   or sought benefit from such actions.

   <b>Signed by</b>: [Enter your full name and student number here before submission]
   
   <b>Dated</b>: [Enter the date that you "signed" the declaration]