# Lab 5: Model Selection, Cross Validation and Regularization

In  this session, we will use some already seen datasets to illustrate important techniques like validation and regularization. First, we come back to the student "exams" dataset to understand the need of testing data in addition to the training data.  

Then, we will illustrate the under-fitting and over-fitting phenomenas on randomly generated data. After that, we will select a model that fits the best our cross validation data.  

Finally, we will implement regularization technique on the "Microchip" testing dataset. Thus, we could understand how this technique helps to prevent over-fitting. 

### Train and Test data: Student "exams" dataset
In this section, we will train a logistic classifier on the training data of student "exams" dataset. Then, we will predict the student admission of the test data and compare the accuracy of the classifier on the training and test data.

<font color="blue">**Question 1: **</font>The *"exams_train_data.txt"* file contains 3 columns that represent the exam 1, exam 2 scores and the results of 100 students (0: Not admitted, 1: Admitted). 
- Load train data from "exams_train_data.txt" file in "students_results_train" variable and check its size. (use [loadtxt](https://docs.scipy.org/doc/numpy-1.13.0/reference/generated/numpy.loadtxt.html) function from numpy library)
- Implement the "Poly_Features" function that concatenates to data array the different possible powers (below deg) and interaction terms of feature vectors f1 and f2 as shown below:$$data=[data,~f_1,~ f_2,~ f_1^2,~ f_1\times f_2,~ f_2^2,~ \dots,~ f_1^{deg},~ f_1^{deg-1}\times f_2,~\dots,~ f_2^{deg}]$$



In [2]:
import numpy as np
import matplotlib.pyplot as plt
from scipy.optimize import fmin_bfgs

#load training data
students_results_train = np.loadtxt("exams_train_data.txt",delimiter='\t')

# you could verify the size of the data using shape() function on numpy array house_data
print("The training data contains {0} student results. There are {1} columns for each exam score and 1 column for admission".format(students_results_train.shape[0],students_results_train.shape[1]-1))

def Poly_Features(data,f1,f2,deg):
    for i in range(1,deg+1):
        for j in range(0,i+1):
            f = (np.power(f1,(i-j))*np.power(f2,j))
            data = np.concatenate((data,f),axis=1)
    return data


m_train = students_results_train.shape[0] # number of student
x_1_train = students_results_train[:,0,np.newaxis] # we add np.newaxis in the indexing to obtain an array 
x_2_train = students_results_train[:,1,np.newaxis] # with shape (100,1) instead of (100,)
y_train = students_results_train[:,2,np.newaxis] # the student admission result vector

# add polynomial features to the array data X
degree=2  # degree of polynomial feature
X_train=np.ones((m_train,1))   # initialize X array
X_train = Poly_Features(X_train,x_1_train,x_2_train,degree)  
n = X_train.shape[1]  # number of features
print("The number of features is: ",n)

# define sigmoid function for logistic regression hypothesis 
def sigmoid(z):
    return np.ones(z.shape)/(1+np.exp(-z))

# define logistic cost function (inspired from max likelihood)
def cost_func(theta):
    J=np.sum(-y_train*np.log(sigmoid(np.dot(X_train,theta[:,np.newaxis]))))-np.sum((1-y_train)*np.log(1-sigmoid(np.dot(X_train,theta[:,np.newaxis]))))
    return J/m_train  

# define the gradient of logistic cost function 
def grad_cost_func(theta):
    g=(1/m_train)*(np.dot(X_train.transpose(),(sigmoid(np.dot(X_train,theta[:,np.newaxis]))-y_train)))  # this is the vectorized implementation
    g.shape=(g.shape[0],)
    return g  

# calculate the optimal theta
theta0=np.zeros((n,),dtype=float)
theta_opt= fmin_bfgs(cost_func,theta0,fprime=grad_cost_func,disp=0)

The training data contains 100 student results. There are 2 columns for each exam score and 1 column for admission
The number of features is:  6




<font color="blue">**Question 2: **</font>
- Predict "y_pred_train" the admission result of each student on train data.
- Calculate the training accuracy (number of good prediction/number of all student) on the training data.

In [3]:
%matplotlib notebook
from matplotlib.patches import Rectangle

# calculate predection and train accuracy
#y_pred = sigmoid(np.dot(X,theta_opt))>=0.5
y_pred_train = sigmoid(np.dot(X_train,theta_opt))>=0.5 
train_accuracy =  100*(np.sum(y_pred_train[:,np.newaxis]==y_train))/m_train 
print("The accuracy on the training data is:", train_accuracy,"%")

# calculate the mesh grid for contour plot
u1=np.linspace(5,20,100)
u2=np.linspace(5,20,100)
u1, u2 = np.meshgrid(u1, u2)

X3=np.ones((*u1.shape,1))
for i in range(1,degree+1):
    for j in range(i+1):
         X3 = np.concatenate((X3,u1[...,np.newaxis]**(i-j)*u2[...,np.newaxis]**j),axis=-1)

Z=np.dot(X3,theta_opt)

# plot descision boundries
plt.figure("Admission decision boundries",figsize=(9,5))
fail=plt.scatter(x_1_train[y_train==0], x_2_train[y_train==0],  color='red',label='fail')
success=plt.scatter(x_1_train[y_train==1], x_2_train[y_train==1],  color='green',marker='+',s=80,label='success')
plt.xlabel('Exam 1 score')
plt.ylabel('Exam 2 score')
plt.title('Adimitted/Not admitted Students')
ctr = plt.contour(u1, u2, Z,0,colors="blue")
extra = Rectangle((0, 0), 3, 4, fc="w", fill=False, edgecolor="b", linewidth=1)
plt.legend([extra,fail,success], ("decision boundries","fail","success"),loc='best')


The accuracy on the training data is: 98.0 %


<IPython.core.display.Javascript object>

<matplotlib.legend.Legend at 0x115e1cf28>

<font color="blue">**Question 3: **</font>
- Load test data from "exams_test_data.txt" file in "students_results_test" variable and check its size. (use [loadtxt](https://docs.scipy.org/doc/numpy-1.13.0/reference/generated/numpy.loadtxt.html) function from numpy library).
- Calculate the test accuracy (number of good prediction/number of all student) on the test data and compare it with the train accuracy. Interpret the difference.

In [4]:
students_results_test =  np.loadtxt("exams_test_data.txt",delimiter='\t')

m_test = students_results_test.shape[0] # number of student in test data
x_1_test = students_results_test[:,0,np.newaxis] # we add np.newaxis in the indexing to obtain an array 
x_2_test = students_results_test[:,1,np.newaxis] # with shape (100,1) instead of (100,)
y_test = students_results_test[:,2,np.newaxis] # we add np.newaxis in the indexing to obtain an array with shape (100,1) instead of (100,)

# add polynomial features to the array data X
X_test=np.ones((m_test,1))   # initialize X array
X_test = Poly_Features(X_test,x_1_test,x_2_test,degree)

# calculate prediction and accuracy on test data
y_test_pred = sigmoid(np.dot(X_test,theta_opt))>=0.5
test_accuracy = (np.sum(y_test_pred==y_train))/m_train
print("The accuracy on the test data is:", test_accuracy,"%")

# plot descision boundries and test data
plt.figure("decision boundries and test data",figsize=(9,5))
fail=plt.scatter(x_1_test[y_test==0], x_2_test[y_test==0],  color='red')
success=plt.scatter(x_1_test[y_test==1], x_2_test[y_test==1],  color='green',marker='+',s=80)
fail_train=plt.scatter(x_1_train[y_train==0], x_2_train[y_train==0],  color='gray')
success_train=plt.scatter(x_1_train[y_train==1], x_2_train[y_train==1],  color='gray',marker='+',s=80)
plt.xlabel('Exam 1 score')
plt.ylabel('Exam 2 score')
plt.title('Adimitted/Not admitted Students')
ctr = plt.contour(u1, u2, Z,0,colors="blue")
plt.legend([extra,fail,success,fail_train,success_train], ("decision boundries","test data (fail)","test data (success)","train data (fail)","train data (success)"),loc='best')


The accuracy on the test data is: 13.0 %


<IPython.core.display.Javascript object>

<matplotlib.legend.Legend at 0x115f0a240>

### Underfitting and Overfitting

In this part we will study the effect of the number of features and the model complexity on the training phase and we will illustrate the underfitting and overfitting phenomena. We will use randomly generated data for a regression problem and we will try to use several models with different number of features (different polynomial features degrees). Then, we will see how well the model fits the training and the test data.

<font color="blue">**Question 4: **</font>
- Split the data ("x" and "y" vector) to training and test data with size "m_train" and "m_test" respectively. When the original data is sorted, you should choose the training and the test sets randomly to ensure that the two sets cover the possible values space in a best manner.  
**Hint:** You could use [random permutation](https://docs.scipy.org/doc/numpy/reference/generated/numpy.random.permutation.html) function to generate a random permutation of "m" indices for "x" and "y" vectors. Then, you could select the first "m_train" indices to index train data from "x" and "y". You could use the rest of indices to index the test data.
- Use [PolynomialFeatures](http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.PolynomialFeatures.html) class and [PolynomialFeatures.fit_transform](http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.PolynomialFeatures.html#sklearn.preprocessing.PolynomialFeatures.fit_transform) functions from numpy library to generate polynomial features of degree "i". This will generate automatically the features instead of adding them by hand as in the previous example with the implemented function "Poly_Features".

In [11]:
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures

# generate random data
np.random.seed(0)
m = 15
x = np.linspace(0,10,m) + np.random.randn(m)/5
y = np.sin(x)+x/6 + np.random.randn(m)/10


# calculate the size of training and test sets
train_ratio = 0.75
m_train = int(round(train_ratio*m)) 
m_test = m-m_train
print("the size of training set is:",m_train,"\nthe size of test set is:",m_test)

# split the data to training and test sets
np.random.seed(9599)
rand_perm =  np.random.permutation(x)
X_train = rand_perm[:m_train][:,np.newaxis]
X_test =  rand_perm[m_train:][:,np.newaxis]
rand_perm =  np.random.permutation(y)
y_train = rand_perm[:m_train][:,np.newaxis]
y_test =  rand_perm[m_train:][:,np.newaxis]

# visualize training and test set
plt.figure("training and test set")
plt.scatter(X_train, y_train, label='training data')
plt.scatter(X_test, y_test, label='test data')
plt.legend(loc="lower right")

# train several polynomial models
regr = LinearRegression()
deg=[1,3,6,9,11]
Poly_predict=np.zeros((len(deg),200))
for i in range(len(deg)):
    poly = PolynomialFeatures(i)
    new_X = poly.fit_transform(X_train)
    regr.fit(new_X, y_train)
    u = np.linspace(0,11.5,200)   # 
    new_U = poly.fit_transform(u[:,np.newaxis])
    Poly_predict[i,:]=regr.predict(new_U).transpose()

# visualize different polynomial models
plt.figure("different polynomial models",figsize=(9,5))
plt.plot(X_train, y_train, 'o', label='training data', markersize=10)
plt.plot(X_test, y_test, 'o', label='test data', markersize=10)
for i,degree in enumerate(deg):
    plt.plot(u, Poly_predict[i], alpha=0.8, lw=2, label='degree={}'.format(degree))
plt.ylim(-1.5,4)
plt.legend(loc="best")


the size of training set is: 11 
the size of test set is: 4
(4, 1)


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<matplotlib.legend.Legend at 0x1a2559f0b8>

<font color="green">**Notes: **</font>  

We note that the different polynomial models fit our data differently. For instance, the polynomial model with degree 1 (linear model) fits poorly our data. It doesn't explain a lot of variation in the data since it doesn't include enough features. We say that this is an **"underfitting"** or a **"high bias"** problem.  

On the other hand, when using higher order polynomial with degree 11, we note that the model go through all the points in training data. However, it is not a good fit for our data because it introduce a lot of variation and a lot of features and it doesn't generalize well for test data. This is called **"overfitting"** or **"high variance"** problem.  

In order to choose the best polynomial order that fits our data, we will have recourse to model selection techniques.

### Model Selection and Cross Validation
In this part, we will train several polynomial models. Then, we will assess their performance on the training and the test sets. Hence, we will choose  the model with best performance in both training and test set. However, the calculated performance on test data won't be a good estimation of the performance of our model in general case. In fact, our model order (polynomial degree) is fitted to the test data. Thus, it tends to perform better on test data than on a new data. Therefore, we introduce the cross validation data used for tuning model meta-parameters (polynomial degree, classification threshold...). Then, we will use the test data to estimate the performance of our model in general case.  

<font color="blue">**Question 5: **</font>
- Split the original data ("x" and "y" vector) to training, validation and test data with size "m_train", "m_val" and "m_test" respectively.  
**Hint:** You could use [random permutation](https://docs.scipy.org/doc/numpy/reference/generated/numpy.random.permutation.html) function to generate a random permutation of "m" indices for "x" and "y" vectors. Then, you could select the first "m_train" indices to index train data from "x" and "y". Then, you can act similarly for validation and test sets.
- Calculate "train_error" and "val_error" mean squared error on training and validation set for each polynomial model with degree "i" (for loop counter).  
**Hint:** You could use [mean_squared_error](http://scikit-learn.org/stable/modules/generated/sklearn.metrics.mean_squared_error.html) function from sklearn library to evaluate the mean squared error between original "y" and "y_predicted".
- From the train and validation error graph and values, select the polynomial degree "best_poly_deg" that fit the best our data. Then, compare the training, validation and test error of this polynomial model. What do you notice?

In [24]:
from sklearn.metrics.regression import mean_squared_error

# calculate the size of training, validation and test sets
train_ratio = 0.6
val_ratio = 0.2
m_train = int(round(train_ratio*m)) 
m_val = int(round(val_ratio*m)) 
m_test = m-m_train-m_val
print("the size of training set is:",m_train,"\nthe size of validation set is:",m_val,"\nthe size of test set is:",m_test)

# split data to training, validation and test sets
np.random.seed(5190969)
rand_permx = np.random.permutation(x)
rand_permy = np.random.permutation(y)
X_train = rand_permx[:m_train]
y_train = rand_permy[:m_train]
X_val = rand_permx[:m_val]
y_val = rand_permy[:m_val][:,np.newaxis]
X_test = rand_permx[:m_test]
y_test = rand_permy[:m_test][:,np.newaxis]

# visualize training, validation and test sets
plt.figure("training, validation and test sets")
plt.scatter(X_train, y_train, label='training data')
plt.scatter(X_val, y_val, label='validation data')
plt.scatter(X_test, y_test, label='test data')
plt.legend(loc="lower right")

# train and assess several polynomial models
train_error = np.zeros((deg[-1]+1,)) 
val_error = np.zeros((deg[-1]+1,))

for i in range(deg[-1]+1):
    poly = PolynomialFeatures(i)
    new_X_train= poly.fit_transform(X_train[:,np.newaxis])
    new_X_val = poly.fit_transform(X_val[:,np.newaxis])
    regr.fit(new_X_train, y_train[:,np.newaxis])
    y_train_predicted = regr.predict(new_X_train)
    y_val_predicted = regr.predict(new_X_val)
    
    train_error[i] =  mean_squared_error(y_train, y_train_predicted)
    val_error[i] =    mean_squared_error(y_val, y_val_predicted)
    print("degree {} polynomial has a train error: {:.5f} and validation error: {:.4f}".format(i,train_error[i],val_error[i]))

# visualize error of each polynomial model
plt.figure("train and validation error of each polynomial model")  
plt.plot(range(deg[-1]+1),train_error[:deg[-1]+1],label="train error")
plt.plot(range(deg[-1]+1),val_error[:deg[-1]+1],label="validation error")
plt.ylim(0,1)
plt.legend(loc="best")

best_poly_deg = 8
poly = PolynomialFeatures(best_poly_deg)
new_X_train= poly.fit_transform(X_train[:,np.newaxis])
new_X_test = poly.fit_transform(X_test[:,np.newaxis])
regr.fit(new_X_train, y_train[:,np.newaxis])
y_test_predicted = regr.predict(new_X_test)
test_error = mean_squared_error(y_test, y_test_predicted)
print("train error =",train_error[best_poly_deg],"\nvalidation error =",val_error[best_poly_deg],"\ntest error =",test_error)


the size of training set is: 9 
the size of validation set is: 3 
the size of test set is: 3


<IPython.core.display.Javascript object>

degree 0 polynomial has a train error: 0.44388 and validation error: 0.2062
degree 1 polynomial has a train error: 0.33902 and validation error: 0.1687
degree 2 polynomial has a train error: 0.31261 and validation error: 0.2112
degree 3 polynomial has a train error: 0.09263 and validation error: 0.2323
degree 4 polynomial has a train error: 0.09082 and validation error: 0.2214
degree 5 polynomial has a train error: 0.08176 and validation error: 0.2309
degree 6 polynomial has a train error: 0.07815 and validation error: 0.2134
degree 7 polynomial has a train error: 0.04668 and validation error: 0.1399
degree 8 polynomial has a train error: 0.00000 and validation error: 0.0000
degree 9 polynomial has a train error: 0.00000 and validation error: 0.0000
degree 10 polynomial has a train error: 0.00000 and validation error: 0.0000
degree 11 polynomial has a train error: 0.00000 and validation error: 0.0000


<IPython.core.display.Javascript object>

train error = 5.42088836355e-13 
validation error = 3.6162113436e-13 
test error = 3.6162113436e-13


### Regularization

In this section, we will split the "Microchip" dataset into training and test sets. We will train a polynomial logistic classifier with and without regularization. Then, we will compare train and test accuracy in the two cases.  

The regularization helps to avoid the problem of overfitting by reducing or even making null $\theta_j's$ of non significant features. The idea is to include the sum of $\theta_j's$ in cost function in order to reduce them and remove useless features in our models.

The regularized cost function is equal to: $$J_{Reg}(\theta)= J(\theta)+\frac{\lambda}{2m}\times\sum_{j=1}^{n-1} \theta_j^2$$

The gradient of regularized cost function is equal to: 

$$\nabla J_{Reg}(\theta) = \begin{bmatrix}\frac{\partial J(\theta)}{\partial \theta_0}
\\ \frac{\partial J(\theta)}{\partial \theta_1}
\\ \vdots
\\ \frac{\partial J(\theta)}{\partial \theta_{n-1}}
\end{bmatrix}+\frac{\lambda}{m} \begin{bmatrix}0
\\ \theta_1
\\ \vdots
\\ \theta_{n-1}
\end{bmatrix}$$ 
where: $ \left\{\begin{matrix}
\frac{\partial J(\theta)}{\partial \theta_j} = \frac{1}{m} \sum_{i=1}^{m}{(h_\theta(x_i) - y)~x_j} ~~for~ j=0\dots n-1 
\\ h_\theta(x_i)=sigmoid(\theta^\top x_i)=\frac{1}{1+e^{-\theta^\top x_i}}
\end{matrix}\right.$
<font color="blue">**Question 6: **</font>
- Load data from "microchip.txt" file (use [loadtxt](https://docs.scipy.org/doc/numpy-1.13.0/reference/generated/numpy.loadtxt.html) function from numpy library).
- Split "X" and "Y" data into train and test sets with a size of 90% and 10% respectively.  
**Hint: ** You could use [train_test_split](http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html) function from sklearn library instead of generating random permutation by hand.
- Implement regularized cost and gradient function according to equations above.

In [41]:
from sklearn.model_selection import train_test_split

# load and extract data
microchip_data =  np.loadtxt("microchip.txt",delimiter='\t')
m = microchip_data.shape[0] # number of mocrochips
x_1 = microchip_data[:,0,np.newaxis] # we add np.newaxis in the indexing to obtain an array 
x_2 = microchip_data[:,1,np.newaxis] # with shape (118,1) instead of (118,)
Y = microchip_data[:,2,np.newaxis] # we add np.newaxis in the indexing to obtain an array with shape (118,1) instead of (118,)

# split data to train and test sets
[X_train, X_test], [y_train, y_test] = train_test_split(np.concatenate((x_1, x_2), axis=1), train_size=0.9, test_size=0.1),train_test_split(y, train_size=0.9, test_size=0.1)

# define sigmoid function for logistic regression hypothesis 
def sigmoid(z):
    return np.ones(z.shape)/(1+np.exp(-z))

# define logistic cost function (inspired from max likelihood)
def cost_func(theta):
    J=np.sum(-y_train*np.log(sigmoid(np.dot(X_train,theta[:,np.newaxis]))))-np.sum((1-y_train)*np.log(1-sigmoid(np.dot(X_train,theta[:,np.newaxis]))))
    return J/m_train  

# define the gradient of logistic cost function 
def grad_cost_func(theta):
    g=(1/m_train)*(np.dot(X_train.transpose(),(sigmoid(np.dot(X_train,theta[:,np.newaxis]))-y_train)))  # this is the vectorized implementation
    g.shape=(g.shape[0],)
    return g  

def Reg_cost_func(theta):
    J = cost_func(theta)+(1/(2*m)*np.sum(theta**2))
    return J

def Reg_grad_cost_func(theta):
    g=grad_cost_func(theta)+((1/m)*theta)
    
    return g  

# model meta-parameters
degree=6       # degree of polynomial features
lambda_=0      # regularization coefficient 
print("The degree of the polynomial model is:",degree,"\nThe regularization coefficient \u03BB is:",lambda_," (no regularization)" if (lambda_==0) else "")  #\u03BB is the unicode caractère lambda

# add polynomial features to the train data
x_1_train = X_train[:,0,np.newaxis] 
x_2_train = X_train[:,1,np.newaxis]
y=y_train

X=np.ones((X_train.shape[0],1))   # initialize X array
X = Poly_Features(X,x_1_train,x_2_train,degree)
n = X.shape[1]

# calculate optimal theta
theta0=np.zeros((n,))
theta_opt = fmin_bfgs(Reg_cost_func,theta0,fprime=Reg_grad_cost_func,disp=0) 


The degree of the polynomial model is: 6 
The regularization coefficient λ is: 0  (no regularization)


ValueError: shapes (106,2) and (28,1) not aligned: 2 (dim 1) != 28 (dim 0)

<font color="blue">**Question 7: **</font>
- Calculate the train and test accuracies. 
- In the previous block of code, vary the values of "degree" (between 2-8) and the values of lambda (between 0: no regularization to 100: a lot of regularization. You could try also 0.1, 1, 10... values). What is the effect of regularization and lambda parameters?
- Compare the value of train and test accuracy in the case with regularization (lambda=0.1 or 1) and without (lambda=0).

In [None]:
# calculate train accuracy
y_train_pred =    # ** your code here** 
train_accuracy =  # ** your code here** 
print("The accuracy on the training data is:", train_accuracy,"%")

# add polynomial features to the test data  
x_1_test = X_test[:,0,np.newaxis] 
x_2_test = X_test[:,1,np.newaxis]
X_test_poly=np.ones((X_test.shape[0],1))   # initialize X_test_poly array
X_test_poly = Poly_Features(X_test_poly,x_1_test,x_2_test,degree)

# calculate test accuracy
y_test_pred =   # ** your code here** 
test_accuracy = # ** your code here** 
print("The accuracy on the test data is:", test_accuracy,"%")

# calculate the mesh grid for contour plot
u1=np.linspace(-1,1.5,50)
u2=np.linspace(-1,1.5,50)
u1, u2 = np.meshgrid(u1, u2)
X3=np.ones((*u1.shape,1))
for i in range(1,degree+1):
    for j in range(i+1):
         X3 = np.concatenate((X3,u1[...,np.newaxis]**(i-j)*u2[...,np.newaxis]**j),axis=-1)
Z=np.dot(X3,theta_opt)

# plot descision boundries
plt.figure("Microchip decision boundries",figsize=(9,5))
fail=plt.scatter(x_1_test[y_test==0], x_2_test[y_test==0],  color='red',label='fail')
success=plt.scatter(x_1_test[y_test==1], x_2_test[y_test==1],  color='green',marker='+',s=80,label='success')
fail_train=plt.scatter(x_1_train[y_train==0], x_2_train[y_train==0],  color='gray',label='fail')
success_train=plt.scatter(x_1_train[y_train==1], x_2_train[y_train==1],  color='gray',marker='+',s=80,label='success')
fail_train=plt.scatter(x_1_train[y_train==0], x_2_train[y_train==0],  color='gray',label='fail')
success_train=plt.scatter(x_1_train[y_train==1], x_2_train[y_train==1],  color='gray',marker='+',s=80,label='success')
plt.xlabel('Test 1 score')
plt.ylabel('Test 2 score')
plt.title('Accepted/Rejected Microchip')
ctr = plt.contour(u1, u2, Z,0,colors="blue")
plt.legend([extra,fail,success,fail_train,success_train], ("decision boundries","test data (fail)","test data (success)","train data (fail)","train data (success)"),loc='best')
