# Questions on Naïve Bayes and k-Nearest Neighbors (kNN)

### Question 1

Assume that we want to use a Naïve Bayes classifier on a binary
classification task, with the class labels being $c1$ and $c2$ and
involving the binary features $f1$ and $f2$.  Moreover, asume a
uniform class prior, i.e, $P(c1) = P(c2)$ and that the class
conditional probabilities include $P(f1=0|c1) = 0$ and $P(f2=0|c2) = 1$.

What class label $c \in \{c1,c2\}$ maximizes $P(c|f1=0 \& f2=1)$?

A: c1 will get a higher probability than c2

B: c2 will get a higher probability than c1

C: c1 and c2 will get the same probabilities

## Question 2

Assume that we are facing a binary classification task, where a positive class label ($+$) is observed when the binary features $f_1$ and $f_2$ both have a value of 0 or 1, and a negative label ($-$) is observed in all other cases, i.e., when $f_1 \neq f_2$. 

Can Naïve Bayes be expected to learn an accurate model for this task?

A: Yes

B: No

C: Maybe

## Question 3

Assume that a large number of binary features are added to a dataset
with two class labels c1 and c2, such that for each added feature f,
the class conditional probability $P(f=0|c1) = P(f=0|c2)$. What
potential effect will the addition of such features have on the
accuracy of Naïve Bayes and kNN respectively?

A This will not have any effect on any of the algorithms

B This will have an effect on Naïve Bayes only

C This will have an effect on k-Nearest Neighbors only

D This will have an effect on both algorithms

## Question 4

Assume that for one feature there is a large number of missing values, while the non-missing values are all identical. May the k-nearest neighbor algorithm using the Euclidean distance be affected if we chose to remove the feature completely instead of imputing values?

A: Yes and this holds independently of how we chose to do the imputation

B: No, we will always get the same results

C: The results will be the same if we chose to impute missing values with the mean of the non-missing

D: The previous (C) would hold also even if the non-missing values are not identical 

## Question 5

In [None]:
# Which of the following will be useful for finding the nearest neighbors 
# to a test object x_test in a set of instances X_train?

# A
A = sorted([distance(x_test,x_train) for x_train in X_train])

# B
B = sorted([(i, distance(x_test,x_train)) for i, x_train in enumerate(X_train)])

# C
C = sorted([(distance(x_test,x_train),i) for i, x_train in enumerate(X_train)])

# D
D = np.argsort([distance(x_test,x_train) for x_train in X_train])

## Question 6

In [None]:
import numpy as np
import pandas as pd

df = pd.DataFrame({"CLASS":["a","b","c","c","b"], 
                   "F1":[1,1,2,2,3],
                   "F2":[2,2,4,4,2]})

In [None]:
# Which of the following will generate a dictionary with a mapping from the
# class labels to the relative frequencies?

# A
a = {}
for c in df["CLASS"].astype("category").cat.categories:
    a[c] = sum(df["CLASS"]==c)/len(df)

# B
b = {}
for c, g in df.groupby("CLASS"):
    b[c] = len(g)/len(df)

# C
c = {c:sum(df["CLASS"]==c)/len(df) for c in pd.unique(df["CLASS"])}

# D
d = df["CLASS"].value_counts(normalize=True)