# Questions on Naïve Bayes and k-Nearest Neighbors (kNN)

### Question 1

Assume that we want to use a Naïve Bayes classifier on a binary
classification task, with the class labels being $c1$ and $c2$ and
involving the binary features $f1$ and $f2$.  Moreover, asume a
uniform class prior, i.e, $P(c1) = P(c2)$ and that the class
conditional probabilities include $P(f1=0|c1) = 0$ and $P(f2=0|c2) = 1$.

What class label $c \in \{c1,c2\}$ maximizes $P(c|f1=0 \& f2=1)$?

A: c1 will get a higher probability than c2

B: c2 will get a higher probability than c1

C: c1 and c2 will get the same probabilities

Correct answer: C

Explanation: When calculating $P(c1|f1=0 \& f2=1)$ using Naïve Bayes, the class prior $P(c1)$ is multiplied with $P(f1=0|c1)$ and $P(f2=1|c1)$, and divided by $P(f1=0 \& f2=1)$. Since $P(f1=0|c1) = 0$, it means that according to Naïve Bayes, $P(c1|f1=0 \& f2=1) = 0$.

When calculating $P(c2|f1=0 \& f2=1)$ using Naïve Bayes, the class prior $P(c2)$ is multiplied with $P(f1=0|c2)$ and $P(f2=1|c2)$, and divided by $P(f1=0 \& f2=1)$. Since $P(f2=1|c2) = 1 - P(f2=0|c2) = 0$, it means that according to Naïve Bayes, $P(c2|f1=0 \& f2=1) = 0$.

In other words, Naïve Bayes would assign a probability of zero for both class labels, and hence both
labels would maximize the expression. 

## Question 2

Assume that we are facing a binary classification task, where a positive class label ($+$) is observed when the binary features $f_1$ and $f_2$ both have a value of 0 or 1, and a negative label ($-$) is observed in all other cases, i.e., when $f_1 \neq f_2$. 

Can Naïve Bayes be expected to learn an accurate model for this task?

A: Yes

B: No

C: Maybe

Correct answer: No

Explanation: 

From the description of the classification task, it follows that $P(+|f_1=0 \And f_2=0)$ and $P(+|f_1=1 \And f_2=1)$ should be high (ideally 1), and $P(-|f_1=0 \And f_2=0)$ and $P(-|f_1=1 \And f_2=1)$ should be low (ideally 0), while $P(+|f_1=1 \And f_2=0)$ and $P(+|f_1=0 \And f_2=1)$ should be low, and $P(-|f_1=1 \And f_2=0)$ and $P(-|f_1=0 \And f_2=1)$ should be high.

According to Bayes' theorem, the above can be calculated from the class priors, $P(+)$ and $P(-)$, and conditional probabilities, e.g., $P(f_1=0 \And f_2=0|+)$. 

In Naïve Bayes, the conditional probabilities are broken up, e.g., $P(f_1=0 \And f_2=0|+)$ is assumed to be equivalent to the product of $P(f_1=0|+)$ and $P(f_2=0|+)$, i.e., the two events ($f_1=0 \And f_2=0$) are assumed to be independent given the class ($+$). This assumption is clearly violated here, since if one of the events is known, e.g., $f_1=0$, the probability of the other event is clearly affected, e.g., the probability of $f_2=0$ is high, given the class $+$. 

If we would assume that all combinations of feature values are equally likely, and provide labels for a (complete) training set according to the description, both class priors will be the same (0.5) as well as all conditional probabilities employed by Naïve Bayes, i.e., $P(f_1=1|+) = P(f_1=0|+) = P(f_2=1|+) = P(f_2=0|+) = 0.5$ and $P(f_1=1|-) = P(f_1=0|-) = P(f_2=1|-) = P(f_2=0|-) = 0.5$. This means that Naïve Bayes will output the same class probabilities independently of what instance is being classified, hence clearly not being able to discriminate between the two classes.


## Question 3

Assume that a large number of binary features are added to a dataset
with two class labels c1 and c2, such that for each added feature f,
the class conditional probability $P(f=0|c1) = P(f=0|c2)$. What
potential effect will the addition of such features have on the
accuracy of Naïve Bayes and kNN respectively?

A This will not have any effect on any of the algorithms

B This will have an effect on Naïve Bayes only

C This will have an effect on k-Nearest Neighbors only

D This will have an effect on both algorithms

Correct answer: C

Explanation: When looking for the class label $c$ that maximizes $P(c|f1=v1, \ldots, fn=vn)$ for the features $f1, \ldots, fn$ and values $v1, \ldots, vn$, using Naïve Bayes, the class prior $P(c)$ is multiplied with $\prod_{i=1}^{n}P(fi=vi|c)$.

Hence, all features for which $P(f=0|c1) = P(f=0|c2)$ (and hence $P(f=1|c1) = P(f=1|c2)$, since the features are binary and the probabilities sum to one) will have no effect on the predicted class probabilities, and hence the accuracy will not be affected.

However, for kNN the added features will be taken into account in the distance calculations, possibly changing the neighborhood of the test instances, and hence also result in different predictions, which may have a (positive or negative) effect on the accuracy. 

## Question 4

Assume that for one feature there is a large number of missing values, while the non-missing values are all identical. May the k-nearest neighbor algorithm using the Euclidean distance be affected if we chose to remove the feature completely instead of imputing values?

A: Yes and this holds independently of how we chose to do the imputation

B: No, we will always get the same results

C: The results will be the same if we chose to impute missing values with the mean of the non-missing

D: The previous (C) would hold also even if the non-missing values are not identical 

Correct answer: C

Explanation: If all the instances have the same value, then keeping the feature will result in that the same term is added to all the distance calculations; the relative distances will hence not be affected if removing the feature. This does not hold if the instances have more than one unique value, which is the case for alternative D.

## Question 5

In [None]:
# Which of the following will be useful for finding the nearest neighbors 
# to a test object x_test in a set of instances X_train?

# A
A = sorted([distance(x_test,x_train) for x_train in X_train])

# B
B = sorted([(i, distance(x_test,x_train)) for i, x_train in enumerate(X_train)])

# C
C = sorted([(distance(x_test,x_train),i) for i, x_train in enumerate(X_train)])

# D
D = np.argsort([distance(x_test,x_train) for x_train in X_train])

In [None]:
# Correct answers: C and D

## Question 6

In [34]:
import numpy as np
import pandas as pd

df = pd.DataFrame({"CLASS":["a","b","c","c","b"], 
                   "F1":[1,1,2,2,3],
                   "F2":[2,2,4,4,2]})

In [43]:
# Which of the following will generate a dictionary with a mapping from the
# class labels to the relative frequencies?

# A
a = {}
for c in df["CLASS"].astype("category").cat.categories:
    a[c] = sum(df["CLASS"]==c)/len(df)

# B
b = {}
for c, g in df.groupby("CLASS"):
    b[c] = len(g)/len(df)

# C
c = {c:sum(df["CLASS"]==c)/len(df) for c in pd.unique(df["CLASS"])}

# D
d = df["CLASS"].value_counts(normalize=True)

In [40]:
# Correct answers: A, B and C. The last alternative (D) does not generate
# a dictionary, but this can be fixed, e.g., by adding ".to_dict()" at the end.

{'b': 0.4, 'c': 0.4, 'a': 0.2}