# Problem 2-1

Given a data set $X$, $|X|=2M$, with \emph{uniform random} labels (independent of the attribute values).
Assume that there are no duplicates in $X$. 
Labels are disjoint, every object is either labeled $C_1$ or $C_2$, and let $|C_1|=|C_2|=M$.

Because the labels are random, the `majority` classifier that always
predicts the *most frequent label* is the theoretically optimal classifier
(as the attributes are not informative, we do not need to use them).

Hint: do *not* assume the classifier is *always* trained with $|C_{1,\text{training}}|=|C_{2,\text{training}}|$,
but make sure to first identify the training set used in this subproblem.

a) What is the exact error rate of the ``majority'' classifier on the entire data set?

### TODO

b) If we do leave-one-out validation, what is the estimated error rate using the `majority` classifier?

### TODO

c) If we do .632 bootstrap, what apears to be the best classifier in training, and what is its estimated error rate?
(Note: this is the difficult part of the assignment. Use just one iteration, $k=1$, for simplicity.)

### TODO

d) Interpret these results with respect to the \emph{evaluation procedure}.

### TODO

# Problem 2-2

In Moodle, you can find the data set `sonar.all-data`, which contains sonar measurements from a mine-sweeper ship. Each line in the files contains 60 sensor measurements, all separated by a comma followed by the class label in the last column (`M`: mine, `R`: rock).

You are encouraged to use `numpy` functionality where appropriate, but you may not use `sklearn`
functionality (such as the existing evaluation functions) except for the classifiers given in the provided code.

In [None]:
import pandas as pd

df = pd.read_csv('sonar.all-data', header=None)

# Return a (sampled) fraction of all samples. Since frac=1, this is the whole data-set but shuffled
# In addition we reset the index (line numbers)
df = df.sample(frac=1).reset_index(drop=True) 
df

a) Randomly split the data set into two disjunct subsets (`XTrain`, `YTrain`) and 
(`XTest`, `YTest`), where the training data contains $80 \%$ of the original data. 
Print the shape of each array.

In [None]:
# TODO

b) Implement a function `accuracy(classifier,X1,y1,X2,y2)` that uses the function `classifier.fit` to train on $(X_1,y_1)$, then `classifier.predict` to predict labels for $X_2$, and then returns the accuracy of this prediction.

Assume that the classifier uses the following API:

`classifier.fit(X,y)`: Trains the classifier on training matrix $X \in \mathbb R^{n\times d}$ with label vector $y \in \mathbb R^n$. Returns the trained model.

`classifier.predict(X)`: Applies the previously trained classifier on test data $X \in \mathbb R^{n\times d}$. Returns a vector of predictions $\widehat y \in \mathbb R^{n}$.

In [None]:
def accuracy(classifier,X1,y1,X2,y2):
    # TODO

c) Implement a function `holdout_evaluation(X,y,p,k,measure,classifier)`
that selects rows with probability $p$ into the training data $(X_1,y_1)$,
and uses the remaining data as validation data $(X_2,y_2)$.

Make sure that the labels correspond to the data rows.
The function then calls the function `measure` for the given `classifier`, and the generated training and validation data.
Return the average and standard deviation of $k$ repetitions of this procedure.

In [None]:
def holdout_evaluation(X,y,k,p,measure,classifier):
    # TODO

d) Implement a function `cross_validation(X,y,k,measure,classifier)` that `randomly` partitions the data into $k$ disjoint partitions, whose sizes differ by at most 1.

Return the average score (and standard deviation) measured by `measure` using the given `classifier`,
when using each partition `once` for testing, and all `other` partitions for training (i.e., $k$ runs total).

In [None]:
def cross_validation(X,y,k,measure,classifier):
    # TODO

e) Run the provided code to compare different classifiers. 
Discuss the result: which classifier(s) would you pick based on each of the validation procedures,
and what accuracy do you predict for future data?

In [None]:
# Define various classifiers that we want to benchmark.
import numpy
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier

classifiers = [
    ("KNN k=1", KNeighborsClassifier(1)), ("KNN k=2", KNeighborsClassifier(2)),
    ("KNN k=5", KNeighborsClassifier(5)), ("KNN k=10", KNeighborsClassifier(10)),
    ("DecTree maxDepth=2", DecisionTreeClassifier(max_depth=2)),
    ("DecTree maxDepth=4", DecisionTreeClassifier(max_depth=4)),
    ("DecTree maxDepth=6", DecisionTreeClassifier(max_depth=6))
]

# Benchmark classifiers.
print("Classifier\t\tHoldout Eval.\tCross-Val.\tTrain\tTest")
for name, classifier in classifiers:
    numpy.random.seed(0)
    ho, hos = holdout_evaluation(XTrain, YTrain, 10, 1./3, accuracy, classifier)
    cv, cvs = cross_validation(XTrain, YTrain, 10, accuracy, classifier)
    tr = accuracy(classifier, XTrain, YTrain, XTrain, YTrain)
    te = accuracy(classifier, XTrain, YTrain, XTest, YTest)
    print("%-20s\t%.3f ± %.3f\t%.3f ± %.3f\t%.3f\t%.3f" % (name, ho, hos, cv, cvs, tr, te))

### TODO