# Estimation 1
## Practical Exercises

### Exercise 5
In this experiment, we look at the effect of the number of functions tested on the selecting the best function using a validation set. We use the digits dataset with Gaussian SVM. We test over different values of the variance parameter $\gamma$. We test 4, 8, and 12 values in sets 0, 1, and 2 respectively.

Does increasing the number of functions tested increase the probability of selecting a suboptimal choice? Are the results in the experiment consistent with what theory suggests?

In [None]:
# Modified from http://scikit-learn.org/stable/auto_examples/model_selection/grid_search_digits.html

from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import classification_report
from sklearn.metrics import accuracy_score
from sklearn.svm import SVC
from sklearn.model_selection import ShuffleSplit
import matplotlib.pyplot as plt

digits = datasets.load_digits()

# Show the images
images_and_labels = list(zip(digits.images, digits.target))
for index, (image, label) in enumerate(images_and_labels[:4]):
    plt.subplot(1, 4, index + 1)
    plt.axis('off')
    plt.imshow(image, cmap=plt.cm.gray_r, interpolation='nearest')
    plt.title('Training: %i' % label)
plt.show()

# To apply an classifier on this data, we need to flatten the image, to
# turn the data in a (samples, feature) matrix:
n_samples = len(digits.images)
X = digits.images.reshape((n_samples, -1))
y = digits.target

# Split the dataset 
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=42)

# Cross validation with 10 iterations
# score curves, each time with 20% data randomly selected as a validation set.
cv = ShuffleSplit(n_splits=1, test_size=0.3, random_state=42)

param_sets = [[0.01, 0.001, 0.0001, 0.00001],
             [0.01, 0.005, 0.001, 0.0005, 0.0001, 0.00005, 0.00001, 0.000005],
             [0.01, 0.006, 0.003, 0.001, 0.0006, 0.0003, 0.0001, 0.00006, 0.00003, 0.00001, 0.000006, 0.000003]]
for i in range(3):
    # Set the parameters using the validation set
    tuned_parameters = [{'kernel': ['rbf'],
                         'gamma': param_sets[i]}]

    # Do grid search
    clf = GridSearchCV(SVC(C=10), tuned_parameters, cv=cv )
    clf.fit(X_train, y_train)

    print("Set " + repr(i) + ": Choose from " + repr(len(param_sets[i])) + " choices.")
    print("Best parameters found on development set:")
    print(clf.best_params_)
    means = clf.cv_results_['mean_test_score']
    print("Validation set accuracies:")
    print(means)
    y_pred = clf.predict(X_test)
    accuracy = accuracy_score(y_test, y_pred)
    print("Test set accuracy: " + "{0:.2f}".format(accuracy))
    print ("================================================")

### Exercise 8
In this experiment, we will estimate the Rademacher complexity of linear SVM, Gaussian SVM and decision trees. We will use randomly generated 1000 20-dimensional binary vectors as the input set. The parameter C is set to 1 for both linear and Gaussian SVM, and the parameter gamma is set to 1 in Gaussian SVM.

Before running your experiment, predict roughly what the estimated Rademacher complexities of the three classifier classes would be.

In [None]:
import numpy as np
from sklearn import svm
from sklearn import tree
from sklearn.metrics import accuracy_score

train_size = 1000
input_size = 20
num_samples = 100

np.random.seed(0)

# Construct random training set 
train_data = np.random.randint(2,size=(train_size,input_size))

total_lsvm = 0
total_gsvm = 0
total_dt = 0
for i in range(num_samples):
    train_label = np.random.randint(2,size=train_size)
    clf = svm.SVC(kernel='linear', C=1)
    clf.fit(train_data, train_label)
    predict = clf.predict(train_data)
    accuracy = accuracy_score(train_label, predict)
    total_lsvm = total_lsvm + accuracy

    clf = svm.SVC(kernel='rbf', C=1, gamma=1)
    clf.fit(train_data, train_label)
    predict = clf.predict(train_data)
    accuracy = accuracy_score(train_label, predict)
    total_gsvm = total_gsvm + accuracy
    
    clf = tree.DecisionTreeClassifier()
    clf.fit(train_data, train_label)
    predict = clf.predict(train_data)
    accuracy = accuracy_score(train_label, predict)
    total_dt = total_dt + accuracy

# Compute average accuracy and rademacher complexity from accuracy
acc_lsvm = total_lsvm/num_samples
rc_lsvm = 2*acc_lsvm-1
acc_gsvm = total_gsvm/num_samples
rc_gsvm = 2*acc_gsvm-1
acc_dt = total_dt/num_samples
rc_dt = 2*acc_dt-1

print("Linear SVM estimated Rademacher complexity: " + "{0:.2f}".format(rc_lsvm)) 
print("Linear SVM estimated accuracy on random labels: " + "{0:.2f}".format(acc_lsvm))
print("Gaussian SVM estimated Rademacher complexity: " + "{0:.2f}".format(rc_gsvm))
print("Gaussian SVM estimated accuracy on random labels: " + "{0:.2f}".format(acc_gsvm))
print("Decision tree estimated Rademacher complexity: " + "{0:.2f}".format(rc_dt))
print("Decision tree estimated accuracy on random labels: " + "{0:.2f}".format(acc_dt))