### The University of Melbourne, School of Computing and Information Systems
# COMP30027 Machine Learning, 2022 Semester 1

## Assignment 1: Naive Bayes Leaner for Adult Database


**Student Name(s):** `Noah Sebastian`
<br>
**Student ID(s):** `911150`



Marking will be applied on the four functions that are defined in this notebook, and to your responses to the questions at the end of this notebook.

## General info

<b>Due date</b>: Friday, 8 April 2022 7pm

<b>Submission method</b>: Canvas submission

<b>Submission materials</b>: This iPython notebook is a template which you will use for your Assignment 1 submission. You need to only submitted the completed copy of this iPython notebook.

<b>Late submissions</b>: -10% per day up to 5 days (both weekdays and weekends count). Submissions more than 5 days late will not be accepted (resul in a mark of 0).
<ul>
    <li>one day late, -1.0;</li>
    <li>two days late, -2.0;</li>
    <li>three days late, -3.0;</li>
    <li>four days late, -4.0;</li>
    <li>five days late, -5.0;</li>
</ul>

<b>Extensions</b>: Students who are demonstrably unable to submit a full solution in time due to medical reasons or other trauma, may apply for an extension.  In these cases, you should email <a href="mailto:ni.ding@unimelb.edu.au">Ni Ding</a> as soon as possible after those circumstances arise. If you attend a GP or other health care service as a result of illness, be sure to provide a Health Professional Report (HPR) form (get it from the Special Consideration section of the Student Portal), you will need this form to be filled out if your illness develops into something that later requires a Special Consideration application to be lodged. You should scan the HPR form and send it with the extension requests.

<b>Marks</b>: This assignment will be marked out of 20, and make up 20% of your overall mark for this subject.

<b>Materials</b>: Use Jupyter Notebook and Python page on Canvas for information on the basic setup required for this class, including an iPython notebook viewer and the python packages NLTK, Numpy, Scipy, Matplotlib, Scikit-Learn. You can use any Python built-in packages, but do not use any other 3rd party packages; if your iPython notebook doesn't run on the marker's machine, you will lose marks. <b> You should use Python 3</b>.  


<b>Evaluation</b>: Your iPython notebook should run end-to-end without any errors in a reasonable amount of time, and you must follow all instructions provided below, including specific implementation requirements and instructions for what needs to be printed (please avoid printing output we don't ask for). You should edit the sections below where requested, but leave the rest of the code as is. You should leave the output from running your code in the iPython notebook you submit, to assist with marking. The amount each section is worth is given in parenthesis after the instructions. 

You will be marked not only on the correctness of your methods, but also the quality and efficency of your code: in particular, you should be careful to use Python built-in functions and operators when appropriate and pick descriptive variable names that adhere to <a href="https://www.python.org/dev/peps/pep-0008/">Python style requirements</a>. If you think it might be unclear what you are doing, you should comment your code to help the marker make sense of it. We reserve the right to deduct up to 2 marks for unreadable or exessively inefficient code.

8 of the marks available for this Project will be assigned to whether the four specified Python functions work in a manner consistent with the materials from COMP30027. Any other implementation will not be directly assessed (except insofar as it is required to make these five functions work correctly).

12 of the marks will be assigned to your responses to the questions, in terms of both accuracy and insightfulness. We will be looking for evidence that you have an implementation that allows you to explore the problem, but also that you have thought deeply about the data and the behaviour of the Naive Bayes classifier.

<b>Updates</b>: Any major changes to the assignment will be announced via Canvas. Minor changes and clarifications will be announced on the discussion board (ED -> Assignments -> A1); we recommend you check it regularly.

<b>Academic misconduct</b>: For most people, collaboration will form a natural part of the undertaking of this homework, and we encourge you to discuss it in general terms with other students. However, this ultimately is still an individual task, and so reuse of code or other instances of clear influence will be considered cheating. Please check the <a href="https://canvas.lms.unimelb.edu.au/courses/124196/modules#module_662096">CIS Academic Honesty training</a> for more information. We will be checking submissions for originality and will invoke the University’s <a href="http://academichonesty.unimelb.edu.au/policy.html">Academic Misconduct policy</a> where inappropriate levels of collusion or plagiarism are deemed to have taken place.

**IMPORTANT**

Please carefully read and fill out the <b>Authorship Declaration</b> form at the bottom of the page. Failure to fill out this form results in the following deductions: 
<UL TYPE=”square”>
<LI>missing Authorship Declaration at the bottom of the page, -5.0
<LI>incomplete or unsigned Authorship Declaration at the bottom of the page, -3.0
</UL>
**NOTE: COMPLETE AND SUBMIT THIS FILE. YOU SHOULD IMPLEMENT FOUR FUNCTIONS AND INCLUDE YOUR ANSWERS TO THE QUESTIONS IN THIS FILE ONLY. NO OTHER SUBMISSION IS REQUIRED.**

**Keep your code clean. Adding proper comments to your code is MANDATORY.**

## Part 1: Base code [8 marks]

Instructions
1. Do **not** shuffle the data set
2. Treat the attributes as they are(e.g., do **not** convert numeric attributes to categorical or categorical to numeric). Implement a Naive Bayes classifier with appropriate likelihood function for each attribute.
3. You should implement the Naive Bayes classifier from scratch. Do **not** use existing implementations/learning algorithms.
4. You CANNOT have more than one train or predict function. Both continuous numeric attributes and categorical ones should be trained in one `train()` function, similarly for the `predict()`.  
5. Apart from the instructions in point 3, you may use libraries to help you with data reading, representation, maths or evaluation
6. Ensure that all and only required information is printed, as indicated in the final three code cells. Failure to adhere to print the required information will result in **[-1 mark]** per case. *(We don't mind details like you print a list or several numbers -- just make sure the information is displayed so that it's easily accessible)
7. You may change the prototypes of these functions, and you may write other functions, according to your requirements. We would appreciate it if the required functions were prominent/easy to find. 
8. You should add adequate comments to make your code easily comprehendible.*

In [28]:
# This function should prepare the data by reading it from a file and converting it into a useful format for training and testing
# and implement 90-10 splitting as specified in the project description.

def preprocess(filename):
    _split_param = 0.1
    try:
        data = pd.read_csv(filename, skipinitialspace=True)
        data = data.replace('?', np.nan)
        impute = SimpleImputer(strategy='mean')

        # impute missing values for quantitative data
        for col in data[numerical]:
            data[col] = impute.fit_transform(data[[col]])
        # impute missing data for qualitative data
        for col in data[nominal]:
            data[col].fillna(data[col].value_counts().index[0], inplace=True)
        # this surprisingly had no effect on performance

        X = data.iloc[:, :-1]
        y = data.iloc[:, -1]
        # split data into training and test
        X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=_split_param,
                                                            random_state=0,
                                                            shuffle=False)

    except FileExistsError:
        print("Unable to find file \'adults.csv\'")
        return None

    return X_train, y_train, X_test, y_test, data

In [29]:
# This function should calculat prior probabilities and likelihoods (conditional probabilities) from the training data and using
# to build a naive Bayes model

def train(X, y):
    model = {}

    for col in X.columns.values:
        if col in nominal:
            priors = pd.crosstab(X[col], y, margins=True, normalize="index")
            conditionals = pd.crosstab(X[col], y, margins=True, normalize="columns").replace(to_replace=0,value=epsilon_0)
            model[col] = Probs(priors, conditionals)
        elif col in numerical:
            # store the mean and variance to use in gaussian dist
            # priors = mean, conditionals = variance
            current_col = X[col]
            model[col] = GDParams((current_col[y == "<=50K"].mean(), current_col[y == ">50K"].mean()),
                                  (current_col[y == "<=50K"].std(), current_col[y == ">50K"].std()))
    return model

In [30]:
# This function should predict classes for new items in the testing data
def predict(model, X):
    priors = model['work class'].priors.loc['All']  # could use any column, they will all give the same priors for c_i
    p_c1 = priors['<=50K']
    p_c2 = priors['>50K']

    labels = []
    c_1_log_probs = []
    c_2_log_probs = []

    for index, row in X.iterrows():
        probs_c1 = []
        probs_c2 = []

        for col_name in X.columns.values:
            # current value we are finding conditional probability for
            current = row[col_name]
            if col_name in nominal:
                # nominal calculations here
                current_col = model[col_name].conditionals
                # check if row exists first:
                if current in current_col.index:
                    cond_prob = current_col.loc[current]
                    xic1 = cond_prob['<=50K']
                    xic2 = cond_prob['>50K']
                else:
                    xic1 = epsilon_0
                    xic2 = epsilon_0

                probs_c1.append(xic1)
                probs_c2.append(xic2)

            elif col_name in numerical:
                # numerical calculations here
                probs_c1.append(get_gaussian_probs(current,
                                                   model[col_name].priors[0],
                                                   model[col_name].conditionals[0]))
                probs_c2.append(get_gaussian_probs(current,
                                                   model[col_name].priors[1],
                                                   model[col_name].conditionals[1]))

        probs_c1.append(p_c1)
        probs_c2.append(p_c2)

        condition_prob_c1 = np.sum(np.log(probs_c1))
        condition_prob_c2 = np.sum(np.log(probs_c2))

        c_1_log_probs.append(condition_prob_c1)
        c_2_log_probs.append(condition_prob_c2)

        label = np.argmax([condition_prob_c1, condition_prob_c2])
        if label == 0:
            labels.append("<=50K")
        else:
            labels.append(">50K")

    # Return log probs in a nice dataframe
    log_probs_raw = {'<=50K': c_1_log_probs, '>50K': c_2_log_probs, 'Labels': labels}
    log_probs = pd.DataFrame(log_probs_raw)

    return labels, log_probs


In [31]:
# This function should evaliate the prediction performance by comparing your model’s class outputs to ground
# truth labels, return and output accuracy, confusion matrix and F1 score.

def evaluate(actual, predicted):
    # Calculate accuracy score
    acc_score = accuracy_score(actual.values, predicted)

    # Calculate confusion matrix
    conf_matrix = confusion_matrix(actual.values, predicted)

    # calculate f1 score
    tp, fn, fp, tn = confusion_matrix(actual.values, predicted).ravel()
    fone_score = tp / (tp + 0.5 * (fp + fn))

    return acc_score, conf_matrix, fone_score



In [32]:
# Extra functions
def Probs(priors, conditionals):
    return _Probs(priors, conditionals)

def GDParams(mean, sd):
    return _Probs(mean, sd)

def get_gaussian_probs(x, mean, var):
    return stats.norm.pdf(x, mean, var)


In [33]:
# This cell should act as your "main" function where you call the above functions 
# on the full ADULT data set, and print the evaluation results. [0.33 marks]

import collections
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from scipy import stats
from sklearn.metrics import confusion_matrix, accuracy_score
from sklearn.impute import SimpleImputer
from sklearn.model_selection import KFold
from scipy.constants import epsilon_0

# constants/named tuples for readability
nominal = ['work class', 'education', 'marital status', 'occupation', 'relationship', 'race', 'sex', 'native country (region)']
numerical = ['age', 'hours per week', 'education num']
_Probs = collections.namedtuple("Probs", "priors conditionals")
_GDParams = collections.namedtuple("GDParams", "mean sd")




# First, read in the data and apply your NB model to the ADULT data
X_train, y_train, X_test, y_test, processed_data = preprocess( "C:/Users/noahs/OneDrive/Desktop/.uni/COMP30027_Machine "
        "Learning/.assignments/.A1/adult.csv")
model = train(X_train, y_train)
predicted_responses, log_probs = predict(model, X_test)


# Second, print the full evaluation results from the evaluate() function
acc_score, conf_matrix, fone_score = evaluate(y_test, predicted_responses)
print("Accuracy Score:\n", acc_score)
print("Confusion Matrix:\n", conf_matrix)
print("F1 score:", fone_score)


# Third, print data statistics and model predictions, as instructed below 
# N is the total number of instances, F the total number of attributes, L the total number of labels
# The "class probabilities" may be unnormalized
# The "predicted class ID" must be in range (0, L)
N = len(predicted_responses)
F = len(model.keys())
L = len(np.unique(processed_data["label"]))

print("\nAttribute vectors of instances [0, 1, 2]: \n", processed_data.iloc[:3 ,:-1]) # of the first three records in adult.csv

print("\nNumber of instances (N): ", N)
print("Number of attributes (F): ", F)
print("Number of labels (L): ", L)



#print out the prediction results of the last three instances
print("\n\nPredicted class log-probabilities for instance N-3: \n", log_probs.iloc[N - 3, :-1])
print("Predicted class ID for instance N-3: \n", log_probs.iloc[N - 2, -1])
print("\nPredicted class log-probabilities for instance N-2: \n", log_probs.iloc[N - 2, :-1])
print("Predicted class ID for instance N-2: \n", log_probs.iloc[N - 2, -1])
print("\nPredicted class log-probabilities for instance N-1: \n", log_probs.iloc[N - 1, :-1])
print("Predicted class ID for instance N-1: \n", log_probs.iloc[N - 1, -1])




Accuracy Score:
 0.86
Confusion Matrix:
 [[69  8]
 [ 6 17]]
F1 score: 0.9078947368421053

Attribute vectors of instances [0, 1, 2]: 
     age        work class  education  education num      marital status  \
0  68.0           Private    1st-4th            2.0            Divorced   
1  39.0         State-gov  Bachelors           13.0       Never-married   
2  50.0  Self-emp-not-inc  Bachelors           13.0  Married-civ-spouse   

        occupation   relationship   race     sex  hours per week  \
0     Craft-repair  Not-in-family  White  Female            20.0   
1     Adm-clerical  Not-in-family  White    Male            40.0   
2  Exec-managerial        Husband  White    Male            13.0   

  native country (region)  
0           United-States  
1           United-States  
2           United-States  

Number of instances (N):  100
Number of attributes (F):  11
Number of labels (L):  2


Predicted class log-probabilities for instance N-3: 
 <=50K   -20.595888
>50K    -19.489186


## Part 2: Conceptual questions [8 marks for groups of 1] / [16 marks for groups of 2]


If you are in a group of 1, you should respond to Q1 and Q2.

If you are in a group of 2, you should respond to Q1, Q2, Q3 and Q4.

A response to a question should take about 100–250 words. You may need to develope codes or functions to help respond to the question here. 

#### NOTE: We strongly recommend <u>including figures or tables, etc.</u> to support your responses. The figures and tables inserted in Markdown cells must be reproducable by your code.

### Q1 [4 marks]
<u>Sensitivity</u> and <u>specificity</u> are two model evaluation metrics.  A good model should have both sensitivity and specificity high. Use the $2 \times 2$ confusion matrix returned by `evaluate()` to calculate the sensitivity and specificity. Do you see a difference between them? If so, what causes this difference? Provide suggestions to improve the model performance. 

The sensitivity of the model: tp/(tp + fn) = 69/(69+6) = 0.908
The specificity of the model: tn/(tn + fp) = 17/(17+8) = 0.68

`Letting '<=50K','>50K' be positive and negative respectively.`

From the confusion matrix or above, it's possible to see there is a significant difference between the sensitivity and specificity of the model.


We can see this from the specicifity and sensitivity results. 
The sensitivity of the model is very high (0.908), that is, it is very good at classifying test instances with the label '<=50K'. Moreover, it fairs poorly (0.68) in comparison regarding classifying true negatives ('>50K').  


These effects could potentially be attributed to biased sampling data. 
From the distribution of the class labels in the dataset, and subsequently the training/test data sets, we can see there is a dominant class; that of the '<=50K', in fact ~76.9% of instances in adults.csv were labelled as such, with only ~23.1% labelled as '>50K'. 

This distribution of class labels, could enduce the model to over-label test instances as this class, and hence suffer from overfitting.


### Q2 [4 marks]
You can adopt different methods for training and/or testing, which will produce different results in model evaluation. 

(a) Instead of Gaussian, <u>implement KDE</u> for  $P(X_i|c_j)$ for numeric attributes $X_i$. Compare the evaluation results with Gaussian. Which one do you think is more suitable to model $P(X_i|c_j)$, Gaussian or KDE? Observe all numeric attributes and justify your answer.

You can choose an arbitrary value for kernel bandwidth $\sigma$ for KDE, but a value between 3 and 15 is recommended. You should write code to implement KDE, not call an existing function/method such as `KernelDensity` from `scikit-learn`.

(b) Implement <u>10-fold and 2-fold cross-validations</u>.  
	Observe the evaluation results in each fold and the average accuracy, recall and specificity over all folds. 
	Comment on what is the effect by changing the values of $m$ in $m$-fold cross validation. (You can choose either Gaussian or KDE Naive Bayes.)

In [34]:
# KDE functions

def get_kde(x, sigma, col):
    n = 1 / len(col)
    s = []

    for row in col:
        x_min_xi = x - row
        gi = get_gaussian_probs(x_min_xi, 0, sigma)
        s.append(gi)
    # for col in numerical:

    f_of_x = n * np.sum(s)

    return f_of_x

def predict_kde(model, X, X_train, y_train, sigma):
    priors = model['work class'].priors.loc['All']  # could use any column, they will all give the same priors for c_i
    p_c1 = priors['<=50K']
    p_c2 = priors['>50K']

    labels = []
    c_1_log_probs = []
    c_2_log_probs = []

    for index, row in X.iterrows():
        probs_c1 = []
        probs_c2 = []

        for col_name in X.columns.values:
            # current value we are finding conditional probability for
            current = row[col_name]
            if col_name in nominal:
                # nominal calculations here
                current_col = model[col_name].conditionals
                if current in current_col.index:
                    cond_prob = current_col.loc[current]
                    xic1 = cond_prob['<=50K']
                    xic2 = cond_prob['>50K']
                else:
                    xic1 = epsilon_0
                    xic2 = epsilon_0

                probs_c1.append(xic1)
                probs_c2.append(xic2)

            elif col_name in numerical:
                # numerical calculations here
                probs_c1.append(get_kde(row[col_name], sigma, X_train[y_train == '<=50K'].loc[:, col_name].values))
                probs_c2.append(get_kde(row[col_name], sigma, X_train[y_train == '>50K'].loc[:, col_name].values))

        probs_c1.append(p_c1)
        probs_c2.append(p_c2)

        p11 = np.sum(np.log(probs_c1))
        p22 = np.sum(np.log(probs_c2))

        c_1_log_probs.append(p11)
        c_2_log_probs.append(p22)

        label = np.argmax([p11, p22])
        if label == 0:
            labels.append("<=50K")
        else:
            labels.append(">50K")

    # Return log probs in a nice dataframe
    log_probs_raw = {'<=50K': c_1_log_probs, '>50K': c_2_log_probs, 'Labels': labels}
    log_probs = pd.DataFrame(log_probs_raw)

    return labels, log_probs

def q2a(model, X_test, X_train, y_train, y_test, sigma):
    predicted_responses_kde, log_probs_kde = predict_kde(model, X_test, X_train, y_train, sigma)
    acc_score_kde, conf_matrix_kde, fone_score_kde = evaluate(y_test, predicted_responses_kde)
    print("Accuracy Score:\n", acc_score_kde)
    print("Confusion Matrix:\n", conf_matrix_kde)
    print("F1 score:", fone_score_kde)
    return

def cross_val(processed_data, k):
    kf = KFold(n_splits=k, random_state=None)
    acc = []
    rec = []
    spe = []
    for train_index, test_index in kf.split(processed_data):
        X_train, X_test = processed_data.iloc[train_index, :-1], processed_data.iloc[test_index, :-1]
        y_train, y_test = processed_data.iloc[train_index, -1], processed_data.iloc[test_index, -1]

        model_i = train(X_train, y_train)
        pred_i, log_probs_i = predict(model_i, X_test)
        a, r, s = get_q2b_scores(y_test, pred_i)
        acc.append(a)
        rec.append(r)
        spe.append(s)

        # Print out evaluation function results
        acc_score, conf_matrix, fone_score = evaluate(y_test, pred_i)
        print("Accuracy Score:\n", acc_score)
        print("Confusion Matrix:\n", conf_matrix)
        print("F1 score:", fone_score)

    return [np.mean(acc), np.mean(rec), np.mean(spe)]


def get_q2b_scores(actual, predicted):
    # Calculate accuracy score
    acc_score = accuracy_score(actual.values, predicted)

    # get vals from conf. matrix and calc. specificity and recall score
    tp, fn, fp, tn = confusion_matrix(actual.values, predicted).ravel()
    specificity = tn / (tn + fp)
    rec_score = tp / (tp + fn)

    return acc_score, rec_score, specificity


In [36]:
# Question 2a
print("Model scores using KDE: \n")
q2a(model, X_test, X_train, y_train, y_test, sigma=11) # give this a minute :L


print("\nCross_fold_validation results (Gaussian): \n")
# Question 2b
k = 10
cval_res = cross_val(processed_data, k)
print("\n{} fold cross validation results:\nAccuracy: {}, Recall: {}, Specificity: {}".format(k,
                                                                                            cval_res[0],
                                                                                            cval_res[1],
                                                                                            cval_res[2]))

k = 2
cval_res = cross_val(processed_data, k)
print("{} fold cross validation results:\nAccuracy: {}, Recall: {}, Specificity: {}".format(k,
                                                                                            cval_res[0],
                                                                                            cval_res[1],
                                                                                            cval_res[2]))




Model scores using KDE: 

Accuracy Score:
 0.84
Confusion Matrix:
 [[66 11]
 [ 5 18]]
F1 score: 0.8918918918918919

Cross_fold_validation results: 

Accuracy Score:
 0.79
Confusion Matrix:
 [[59 16]
 [ 5 20]]
F1 score: 0.8489208633093526
Accuracy Score:
 0.81
Confusion Matrix:
 [[62 16]
 [ 3 19]]
F1 score: 0.8671328671328671
Accuracy Score:
 0.82
Confusion Matrix:
 [[66 11]
 [ 7 16]]
F1 score: 0.88
Accuracy Score:
 0.86
Confusion Matrix:
 [[68  9]
 [ 5 18]]
F1 score: 0.9066666666666666
Accuracy Score:
 0.86
Confusion Matrix:
 [[73  7]
 [ 7 13]]
F1 score: 0.9125
Accuracy Score:
 0.82
Confusion Matrix:
 [[66 13]
 [ 5 16]]
F1 score: 0.88
Accuracy Score:
 0.8
Confusion Matrix:
 [[66 11]
 [ 9 14]]
F1 score: 0.868421052631579
Accuracy Score:
 0.73
Confusion Matrix:
 [[61 15]
 [12 12]]
F1 score: 0.8187919463087249
Accuracy Score:
 0.78
Confusion Matrix:
 [[63 10]
 [12 15]]
F1 score: 0.8513513513513513
Accuracy Score:
 0.86
Confusion Matrix:
 [[69  8]
 [ 6 17]]
F1 score: 0.9078947368421053

10

Provide your text answer of 150-200 words in this cell.


(a) Both models performed reasonably well when looking at the accuracy score metric. Guassian (0.86) just outperformed KDE (0.84), suggesting relatively similar performance.

Further when looking at the F1 score, Gaussian (0.908) and KDE (0.892) were almost equal.

The significant points of difference between the two models were of the TP count and FN count. We can see that the Gaussian had less FN (8) than that of KDE (11), subsequently had more TP (69) versus KDE (66).

Based on these results, I would suggest that the Gaussian estimation is a better fit for this dataset. As a caveat, I would say that without proper domain knowledge, and further understanding behind the intentions of the model, it is difficult to know which estimation technique is correct, although in our example, I would lean towards Gaussian.

(b) There is a trade off associated with the changing of m. Having a small value of m e.g. (2) can lead to high variance in results and the model may generate values that are irrepresentative of the populaton, and higher values of m e.g. (>10) are computationally expensive and have higher run times. Therefore it is ideal to find a value that balances these two.


### Q3 [4 marks]
In `train()`, you are asked to treat the missing value of nominal attributes as a new category. There is another option (as suggested in Thu lecture in week 2): <u>ignoring the missing values</u>. 
Compare the two methods in both large and small datasets. Comment and explain your observations.
You can extract the first 50 records to construct a small dataset.Use Gaussian Naive Bayes only for this question.

In [6]:
# Write additional code here, if necessary (you may insert additional code cells)

Provide your text answer of 150-200 words in this cell.

### Q4 [4 marks]
In week 4, we have learned how to obtain information gain (IG) and gain ratio (GR) to choose an attribute to split a node in a decision tree. We will see how to apply them in the Naive Bayes classification.

(a) Compute the GR of each attribute $X_i$, relative to the class distribution. In the Na\"ive Bayes classifier, remove attributes in the ascending order of GR: first, remove $P(X_i|c_j)$ such that $X_i$ has the least GR; second, remove $P(X_{i'}|c_j)$ such that $X_{i'}$ has the second least GR,......, until there is only one $X_{i*}$ with the largest GR remaining in the maximand $P(c_j) P(X_{i^*} | c_j)$. Observe the <u>change of the accuracy for both Gaussian and KDE</u> (Choose bandwidth $\sigma=10$ for KDE).

(b) Compute the IG between each pair of attributes. Describe and explain your observations. Choose an attribute and implement an estimator to predict the value of `education num`. Explain why you choose this attribute. Enumerate two other examples that an attribute can be used to estimate the other and explain the reason.  

In [None]:
# Write additional code here, if necessary (you may insert additional code cells)

### (a)

Provide your text answer to **Question 4.a** of 100-150 words in this cell.

### (b)

Provide your text answer to **Question 4.b** of 150-200 words in this cell.

<b>Authorship Declaration</b>:

   (1) I certify that the program contained in this submission is completely
   my own individual work, except where explicitly noted by comments that
   provide details otherwise.  I understand that work that has been developed
   by another student, or by me in collaboration with other students,
   or by non-students as a result of request, solicitation, or payment,
   may not be submitted for assessment in this subject.  I understand that
   submitting for assessment work developed by or in collaboration with
   other students or non-students constitutes Academic Misconduct, and
   may be penalized by mark deductions, or by other penalties determined
   via the University of Melbourne Academic Honesty Policy, as described
   at https://academicintegrity.unimelb.edu.au.

   (2) I also certify that I have not provided a copy of this work in either
   softcopy or hardcopy or any other form to any other student, and nor will
   I do so until after the marks are released. I understand that providing
   my work to other students, regardless of my intention or any undertakings
   made to me by that other student, is also Academic Misconduct.

   (3) I further understand that providing a copy of the assignment
   specification to any form of code authoring or assignment tutoring
   service, or drawing the attention of others to such services and code
   that may have been made available via such a service, may be regarded
   as Student General Misconduct (interfering with the teaching activities
   of the University and/or inciting others to commit Academic Misconduct).
   I understand that an allegation of Student General Misconduct may arise
   regardless of whether or not I personally make use of such solutions
   or sought benefit from such actions.

   <b>Signed by</b>: Noah Sebastian 911150
   
   <b>Dated</b>: 8/04/2022
  