# TP SVM classification: active learning
Diane Lingrand (diane.lingrand@univ-cotedazur)

Polytech - SI4 

Documentation: https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html

In [4]:
#necessary imports
import time
import matplotlib.pyplot as plt
import numpy as np
from sklearn import svm
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay, f1_score, accuracy_score

## MNIST dataset

In [2]:
# loading the dataset
from sklearn.datasets import fetch_openml
mnist = fetch_openml('mnist_784', version=1, cache=True)

# extracting the data and the labels
X, y = mnist.data.to_numpy(), mnist.target.to_numpy()



**Question 1** 
- What is the dimension of the data space ? 
- How many data in the train dataset ?

Compute these values (even if they are available on the net). Print the results in the form (10 and 100 are examples, not the correct values):

    Data are of dimension: 10.
    There are 100 data in the train dataset.

In [5]:
# convert str to int
y = y.astype(np.uint8)

x_train, x_test = X[:60000], X[60000:]
y_train, y_test = y[:60000], y[60000:]

#your answer
print("Data are of dimension: " + str(x_train.shape[1:3]))
print("There are " + str(len(x_train)) + " data in the train dataset")


Data are of dimension: (784,)
There are 60000 data in the train dataset


**Question 1b**: If needed, reshape the data

In [None]:
#your answer, if needed

In [6]:
# you will now consider only 2 classes: the 3's and the 7's
c1 = 3
c2 = 7

**Question 2:**

Set Xtrain and Xtest to contain the part of the data from the original dataset that contains only data with labels 3 or 7. Set yTrain and yTest to the corresponding labels: 0 value for class '3' and 1 value for class '7'.

In [7]:
#your answer
train_filter = (y_train == c1) | (y_train == c2)
x_train = x_train[train_filter]
y_train = y_train[train_filter]

test_filter = (y_test == c1) | (y_test == c2)
x_test = x_test[test_filter]
y_test = y_test[test_filter]

**Question 3:**

How many samples for class '3' and for class '7'? Print the values this way:
    
    Train: There are ... data in class 3 and ... data in class 7.
    Test: There are ... data in class 3 and ... data in class 7.

In [8]:
#your answer
three_train_filter = (y_train == c1)
seven_train_filter = (y_train == c2)
three_test_filter = (y_test == c1)
seven_test_filter = (y_test == c2)

print("Train: There are " + str(len(x_train[three_train_filter])) + " data in class 3 and " + str(len(x_train[seven_train_filter])) + " data in class 7.")
print("Test: There are " + str(len(x_test[three_test_filter])) + " data in class 3 and " + str(len(x_test[seven_test_filter])) + " data in class 7.")

Train: There are 6131 data in class 3 and 6265 data in class 7.
Test: There are 1010 data in class 3 and 1028 data in class 7.


## Baseline: train a linear SVM on the whole train dataset

**Question 4:**

Using a linear kernel and a default C value to 1, learn the SVM classification of 3's versus 7's with the whole train dataset.


In [9]:
#your answer
svm_model = svm.SVC(kernel='linear', C=1)
svm_model.fit(x_train, y_train)

**Question 5:**

Compute the different metrics (F1 score, accuracy and confusion matrix) on the test dataset.

In [None]:
#your answer
y_pred = svm_model.predict(x_test)

accuracy = accuracy_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred, pos_label=c1) #c1 = positif / c2 = negatif
conf = confusion_matrix(y_test, y_pred)

print("Accuracy: " + str(accuracy))
print("F1 score: " + str(f1))
print("Confusion matrix: \n" + str(conf))

Accuracy: 0.9764474975466143
F1 score: 0.9764936336924583
Confusion matrix: 
[[997  13]
 [ 35 993]]


## Active learning with SVM

Start with few annoted data and iterate by asking new labelled data and re-learn SVM separation. Try different selection of new labelled data.

In [None]:
# short reminder for random integers:
import random
a = random.randint(2, 15)
# a is random integer such that 2 <= a <= 15

In [25]:
#In order to avoid any modification in (xTrain, yTrain), we will work on a copy in the next cells:
xTrainP = np.copy(x_train)
yTrainP = np.copy(y_train)

**Question 6: Initialisation of the active training dataset**

Construct a new training dataset named (xActif,yActif). For it's initialisation, take randomly nb0 data from the copy of the original training dataset (xTrainP, yTrainP). You are allowed to use informations from yTrainP in order to get half of nb0 data for each class. These nb0 data are also removed from (xTrainP,yTrainP). Removing data can be done using [np.delete](https://numpy.org/doc/stable/reference/generated/numpy.delete.html).

In [None]:
# we assume that nb0 is an even number
nb0 = 4 # number of data in the active training dataset at initialisation
xActif = []
yActif = []

In [None]:
#your answer

**Question 7: Iterations of the active learning** 

1. Learn a linear SVM classifier on the active training dataset
2. Compute the accuracy on the test dataset (not modified)
3. add randomly nb new data to the active training dataset and remove them from (xTrainP, yTrainP)
4. Go back to step 1 (20 times)

In [None]:
#your answer   

**Question 8: plot the evolution of the accuracy**

Plot the accuracy with respect to the iterations from the previous question.

**Question 9: strategy for choosing new data**
    
Same question as question 7 but, instead of choosing the new points randomly, at each iteration, choose the nb points that are the closest to the separation. The [decision_function](https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html#sklearn.svm.SVC.decision_function) from scikit-learn will help you.

In [None]:
#your answer

**Question 10: plot the evolution of the accuracy**

Plot the accuracy with respect to the iterations from the previous question.
Compare with question 8. Also compare with the baseline.

In [None]:
#your answer

**Question 11: many random starts**
    
Since the initialisation is random, running previous codes can lead to different curves for questions 8 and 10. Write here the code necessary for plotting several (e.g. 10) curves corresponding to questions 8 and 10 and display these new plots. Which one is the best strategy?
    

In [None]:
#your answer

**Question 12: hyperparameters**
So far, you have used the linear kernel with default parameter. Using the strategy of question 9, how could you choose the kernel and the hyperparameters ? Try different experiments such as:
- choose the kernel and hyperparams using nb0 at starting
- update kernel and hyperparams after few itertions
- compare different trials    
    