# Classification on Wine Dataset

## IMPORTANT: make sure to rerun all the code from the beginning to obtain the results for the final version of your notebook, since this is the way we will do it before evaluting your notebook!!!

### Dataset description

We will be working with a dataset on wines from the UCI machine learning repository
(http://archive.ics.uci.edu/ml/datasets/Wine). It contains data for 178 instances. 
The dataset is the results of a chemical analysis of wines grown in the same region
in Italy but derived from three different cultivars. The analysis determined the
quantities of 13 constituents found in each of the three types of wines. 

### The features in the dataset are:

- Alcohol
- Malic acid
- Ash
- Alcalinity of ash
- Magnesium
- Total phenols
- Flavanoids
- Nonflavanoid phenols
- Proanthocyanins
- Color intensity
- Hue
- OD280/OD315 of diluted wines
-Proline



We first import all the packages that are needed

In [1]:
%matplotlib inline
import matplotlib.pyplot as plt


import numpy as np
import scipy as sp
from scipy import stats
from sklearn import datasets
from sklearn import linear_model
from sklearn import preprocessing

# Perceptron
We will implement the perceptron and use it to learn a halfspace with 0-1 loss.

**TO DO** Set the random seed to your ID (matricola).

In [2]:
IDnumber = 2056755
np.random.seed(IDnumber)

Load the dataset from scikit learn and then split in training set and test set (50%-50%) after applying a random permutation to the datset.

In [3]:
# Load the dataset from scikit learn
wine = datasets.load_wine()

m = wine.data.shape[0]
permutation = np.random.permutation(m)

X = wine.data[permutation]
Y = wine.target[permutation]

We are going to classify class "1" vs the other two classes (0 and 2). We are going to relabel the other classes (0 and 2) as "-1" so that we can use it directly with the perceptron.

In [4]:
print(X)
print(Y)

[[1.386e+01 1.350e+00 2.270e+00 ... 1.010e+00 3.550e+00 1.045e+03]
 [1.351e+01 1.800e+00 2.650e+00 ... 1.100e+00 2.870e+00 1.095e+03]
 [1.377e+01 1.900e+00 2.680e+00 ... 1.130e+00 2.930e+00 1.375e+03]
 ...
 [1.416e+01 2.510e+00 2.480e+00 ... 6.200e-01 1.710e+00 6.600e+02]
 [1.305e+01 3.860e+00 2.320e+00 ... 8.400e-01 2.010e+00 5.150e+02]
 [1.293e+01 3.800e+00 2.650e+00 ... 1.030e+00 3.520e+00 7.700e+02]]
[0 0 0 1 0 2 2 2 1 0 2 2 1 0 0 2 0 2 1 2 1 2 0 1 0 0 2 1 2 0 0 0 2 2 1 1 1
 1 1 0 1 0 1 0 0 1 0 2 2 2 1 1 2 2 1 1 0 2 1 0 0 0 2 1 2 1 0 0 2 1 0 0 1 1
 1 0 2 0 1 0 2 0 1 2 1 0 2 2 2 1 0 0 1 1 0 1 1 2 1 1 2 2 0 1 1 1 0 1 0 0 0
 0 2 0 2 0 0 0 0 0 2 2 2 2 1 0 1 2 1 1 2 1 1 1 1 1 1 1 2 1 2 2 1 1 1 1 1 0
 2 1 0 1 0 0 1 1 1 2 2 2 0 1 1 0 2 0 1 0 0 0 1 1 1 1 1 2 1 0]


In [5]:
#let's relabel classes 0 and 2 as -1
for i in range(len(Y)):
    if Y[i] != 1:
        Y[i] = -1

**TO DO** Divide the data into training set and test set (50% of the data each). **Note**: we do not normalize the features since it is not needed for this dataset and task.

In [6]:
#Divide in training and test: make sure that your training set
#contains at least 10 elements from class 1 and at least 10 elements
#from class -1! If it does not, modify the code so to apply more random
#permutations (or the same permutation multiple times) until this happens.

#m_training needs to be the number of samples in the training set
m_training = round(len(X) / 2)

#m_test needs to be the number of samples in the test set
m_test = round(len(X) / 2)

#X_training = instances for training set
X_training = X[:m_training]
#Y_training = labels for the training set
Y_training = Y[:m_training]

#X_test = instances for test set
X_test = X[m_training:]
#Y_test = labels for the test set
Y_test = Y[m_training:]

print(Y_training) #to make sure that Y_training contains both 1 and -1

[-1 -1 -1  1 -1 -1 -1 -1  1 -1 -1 -1  1 -1 -1 -1 -1 -1  1 -1  1 -1 -1  1
 -1 -1 -1  1 -1 -1 -1 -1 -1 -1  1  1  1  1  1 -1  1 -1  1 -1 -1  1 -1 -1
 -1 -1  1  1 -1 -1  1  1 -1 -1  1 -1 -1 -1 -1  1 -1  1 -1 -1 -1  1 -1 -1
  1  1  1 -1 -1 -1  1 -1 -1 -1  1 -1  1 -1 -1 -1 -1]


**TO DO** Now add a 1 in front of each sample so that we can use a vector to describe all the coefficients of the model. You can use the function $hstack$ in $numpy$

In [7]:
#add a 1 to each sample
X_training = np.hstack((np.ones((m_training, 1)), X_training))
X_test = np.hstack((np.ones((m_test, 1)), X_test))

**TO DO** Now complete the function *perceptron*. Since the perceptron does not terminate if the data is not linearly separable, your implementation should return the desired output (see below) if it reached the termination condition seen in class or if a maximum number of iterations have already been run, where 1 iteration corresponds to 1 update of the perceptron weights. If the perceptron returns because the maximum number of iterations has been reached, you should return an appropriate model. 

The input parameters to pass are:
- $X$: the matrix of input features, one row for each sample
- $Y$: the vector of labels for the input features matrix X
- $max\_num\_iterations$: the maximum number of iterations for running the perceptron

The output values are:
- $best\_w$: the vector with the coefficients of the best model
- $best\_error$: the *fraction* of missclassified samples for the best model

In [8]:
print(np.shape(X_training))
print(np.shape(X_test))

(89, 14)
(89, 14)


In [9]:
#Function to perform the perceptron algorithm
def perceptron(X, Y, max_num_iterations):
    #add a coulumn of 0 to X
    w = np.zeros(np.size(X, 1))  
    for i in range(max_num_iterations):
        #inizialization of miss 
        miss = 0
        for k in range(m_training):
            #if encounter an error on labeling we update the miss and the w
            if Y[k] * np.dot(X[k], w) <= 0:
                miss += 1
                w = w + Y[k] * X[k]
        best_w = w
        new_best_error = miss / m_training
        #if no error found w is the optimal 
        if miss == 0:
            break 
        
    return best_w, new_best_error

#num_errors = number of errors in the test set and numb of error over the total number of example
def perceptron_evaluation(m_test, Y_test, X_test, w_found, iteration):
    num_errors = 0.
    for i in range(m_test):
            if (Y_test[i] * np.dot(X_test[i], w_found)) <= 0:
                 num_errors +=1 
         
    return print("Estimated true loss with " + str(iteration) +" iterations: " + str(round(num_errors / m_test,7)) + 
                 " number of errors over num of test sample: " + str(int(num_errors)) + "/" + str(m_test))

Now we use the implementation above of the perceptron to learn a model from the training data using 100 iterations and print the error of the best model we have found.

In [10]:
#now run the perceptron for 100 iterations
w_found, training_error = perceptron(X_training,Y_training, 100)
print("Training error with 100 iterations: " + str(round(training_error,7)))

Training error with 100 iterations: 0.258427


**TO DO** use the best model $w\_found$ to predict the labels for the test dataset and print the fraction of missclassified samples in the test set (that is an estimate of the true loss).

In [11]:
#now use the w_found to make predictions on test dataset

#NOTE: you can avoid using num_errors if you prefer, as long as true_loss_estimate is correct
perceptron_evaluation(m_test, Y_test, X_test, w_found, 100)

Estimated true loss with 100 iterations: 0.4157303 number of errors over num of test sample: 37/89


**TO DO**: [Answer the following] what relation do you observe between the training error and the (estimated) true loss? Is this what you expected? Explain what you observe and why it does or does not conform to your expectations. [Write the answer in this cell]

**ANSWER**:This is what I expected during the use of the Perceptron algorithm. First of all, it is important to understand informally that the Perceptron constructs a sequence of vectors w, where the first w is a vector all equal to zeros. At some iteration, the Perceptron finds an example that is mislabeled that the dot product of w and X is different from the label Y. Then, the Perceptron updates w by adding to it the instance X scaled by the label Y.
I saved vector w_found of the training data and used it to evaluate the true loss on the test data, so what I wrote above, the algorithm is updating itself to fit the train data as its best, so it was quite normal that the true loss score higher than the training error. 
In conclusion, when we evaluate the w_found on the test set it's possible to obserbe a large amount of missclassified example 37/89, the found w_found isn't a good choice for the test data.

**TO DO** Copy the code from the last 2 cells above in the cell below and repeat the training with 10000 iterations. 

In [12]:
#now run the perceptron for 10000 iterations here!
w_found_1, training_error_1 = perceptron(X_training,Y_training, 10000)
#training_error = error on the training set
print("Training error with 10000 iterations: " + str(round(training_error_1,7)))
#NOTE: you can avoid using num_errors if you prefer, as long as true_loss_estimate is correct
perceptron_evaluation(m_test, Y_test, X_test, w_found_1, 10000)

Training error with 10000 iterations: 0.1235955
Estimated true loss with 10000 iterations: 0.0786517 number of errors over num of test sample: 7/89


**TO DO** [Answer the following] What changes in the training error and in the test error (in terms of fraction of missclassified samples)? Explain what you observe. [Write the answer in this cell]

**ANSWER** As expected, comparing the cases respectively with 10000 iterations and 100 iterations, the perceptron performs better in the case of 10000 iteration. In fact, looking at the difference between the traing error with 100 iteration that scores 0.26 and the traing error with 10000 iteration that scores  0.12, it's possible to see a clear improvement of 0.14. However, while evaluating the true loss on the training data I got better result then the traning error, possibly this is due to the fact that we have found the best w for the test data. More precisely, by performing some tests on the perceptron algorithm, we can notice that if we keep increasing the number of iterations the training error keeps going down untill reach value 0. On the other side if we perform the true loss on the w found for every w_found created, it's possible to observe that the true loss doesn't go down from 0.08.
NB: test code down to undestrstand what i had explained before -->

In [None]:
#now run the perceptron for 100000 iterations here!
w_found_2, training_error_2 = perceptron(X_training,Y_training, 100000)
#training_error = error on the training set
print("Training error with 10000 iterations: " + str(round(training_error_2 , 7)))
#NOTE: you can avoid using num_errors if you prefer, as long as true_loss_estimate is correct
perceptron_evaluation(m_test, Y_test, X_test, w_found_2, 100000)

# Logistic Regression
Now we use logistic regression, as implemented in Scikit-learn, to predict labels. We first do it for 2 labels and then for 3 labels. We will also plot the decision region of logistic regression.

We first load the dataset again.

In [None]:
# Load the dataset from scikit learn
wine = datasets.load_wine()

m = wine.data.shape[0]
permutation = np.random.permutation(m)

X = wine.data[permutation]
Y = wine.target[permutation]

**TO DO** As for the previous part, divide the data into training and test (50%-50%), relabel classes 0 and 2 as -1. Here there is no need to add a 1 at the beginning of each row, since it will be done automatically by the function we will use.

In [None]:
#Divide in training and test: make sure that your training set
#contains at least 10 elements from class 1 and at least 10 elements
#from class -1! If it does not, modify the code so to apply more random
#permutations (or the same permutation multiple times) until this happens.
#IMPORTANT: do not change the random seed.

m_training = round(len(X) / 2)
m_test = round(len(X) / 2)
X_training = X[:m_training]
Y_training = Y[:m_training]
X_test = X[m_training:]
Y_test = Y[m_training:]

#let's relabel classes 0 and 2 as -1

for i in range(len(Y)):
    if Y[i] != 1:
        Y[i] = -1

To define a logistic regression model in Scikit-learn use the instruction

$linear\_model.LogisticRegression(C=1e5)$

($C$ is a parameter related to *regularization*, a technique that
we will see later in the course. Setting it to a high value is almost
as ignoring regularization, so the instruction above corresponds to the
logistic regression you have seen in class.)

To learn the model you need to use the $fit(...)$ instruction and to predict you need to use the $predict(...)$ function. See the Scikit-learn documentation for how to use it.

**TO DO** Define the logistic regression model, then learn the model using the training set and predict on the test set. Then print the fraction of samples missclassified in the training set and in the test set.

In [None]:
#code to verify the miss franction
#call to have a better look like code
def miss_fraction(a,b, name_error):
    miss = 0
    for i in range(len(b)):
        if a[i] != b[i]:
            miss +=1
    return print("error rate of " + str(name_error)+ " set: "+str(miss)+"/"+str(len(b)))

In [None]:
#part on logistic regression for 2 classes   
logreg1 = linear_model.LogisticRegression(C = 1e5)

#preprocessing the data
scaler = preprocessing.StandardScaler().fit(X_training)
X_training_scaled = scaler.transform(X_training)
X_test_scaled = scaler.transform(X_test)

#learn from training set
regression_model = logreg1.fit(X_training_scaled, Y_training)

#predict on training set
error_rate_training = regression_model.predict(X_training_scaled)
miss_fraction(error_rate_training, Y_training, "training")

#predict on test set
error_rate_test = regression_model.predict(X_test_scaled)
miss_fraction(error_rate_test, Y_test, "test")
print("Score on the data: "+str(regression_model.score(X_test_scaled, Y_test)))

Now we do logistic regression for classification with 3 classes.

**TO DO** First: let's load the data once again (with the same permutation from before).

In [None]:
#part on logistic regression for 3 classes

#Divide in training and test: make sure that your training set
#contains at least 10 elements from each of the 3 classes!
#If it does not, modify the code so to apply more random
#permutations (or the same permutation multiple times) until this happens.
#IMPORTANT: do not change the random seed.
X = wine.data[permutation]
Y = wine.target[permutation]

#work if we use Y like near to the m_training / 3
#in this case it's really difficult to find a permutation with less then
#10 example each, but the function works
def ten_element_012(Y):
    count_1=0
    count_2=0
    count_0=0
    for i in range(m_training):
        if Y[i] == 1:
            count_1 += 1
        if Y[i] == 0:
            count_0 += 1
        if Y[i] == 2:
            count_2 += 1
    print("number of 0: "+ str(count_0))
    print("number of 1: "+ str(count_1))
    print("number of 2: "+ str(count_2))
    permutation = np.random.permutation(m)
    if count_1 < 10 or count_0 < 10 or count_2 < 10:
        print("fail")
        return ten_element_012(wine.target[permutation])
    X = wine.data[permutation]
    Y = wine.target[permutation]
    return X, Y

X, Y = ten_element_012(Y)
    
X_training = X[:m_training]
Y_training = Y[:m_training]
X_test = X[m_training:]
Y_test = Y[m_training:]

print(Y_test)
print(Y_training)

**TO DO** Now perform logistic regression (instructions as before) for 3 classes, learning a model from the training set and predicting on the test set. Print the fraction of missclassified samples on the training set and the fraction of missclassified samples on the test set.

In [None]:
#define logistic regression model
#NOTE: multi_class = 'ovr'  or 'multiclass' and we score better results
#but in the TODO isn't asked and specified "istructions as before"
logreg2 = linear_model.LogisticRegression(C = 1e5 )

#preprocessing the data
scaler = preprocessing.StandardScaler().fit(X_training)
X_training_scaled = scaler.transform(X_training)
X_test_scaled = scaler.transform(X_test)
print(X_training_scaled[:1,:])

#learn from training set
regression_model = logreg2.fit(X_training_scaled, Y_training)

#predict on training set
error_rate_training = regression_model.predict(X_training_scaled)
miss_fraction(error_rate_training, Y_training, "training")

#predict on test set
error_rate_test = regression_model.predict(X_test_scaled)
miss_fraction(error_rate_test, Y_test, "test")
print("Score on the data: "+str(regression_model.score(X_test_scaled, Y_test)))

**TO DO** Now pick two features and restrict the dataset to include only two features, whose indices are specified in the $feature$ vector below. Then split into training and test.

In [None]:
#to make the plot we need to reduce the data to 2D, so we choose two features

features_list = ['Alcohol',
'Malic acid',
'Ash',
'Alcalinity of ash',
'Magnesium',
'Total phenols',
'Flavanoids',
'Nonflavanoid phenols',
'Proanthocyanins',
'Color intensity',
'Hue',
'OD280/OD315 of diluted wines',
'Proline']
labels_list = ['class_0', 'class_1', 'class_2']

index_feature1 = 0
index_feature2 = 1
features = [index_feature1, index_feature2]

feature_name0 = features_list[features[index_feature1]]
feature_name1 = features_list[features[index_feature2]]

#X_red is X reduced to include only the 2 features of
#indices index_feature1 and index_feature2
X_red = X[:,features]

len1 = round(len(X)/2)

X_red_training = X_red[:len1]
Y_training = Y[:len1]
print(np.shape(X_red_training))
X_red_test = X_red[len1:]
Y_test = Y[len1:]

Now learn a model using the training data.

In [None]:
logreg3 = linear_model.LogisticRegression(C = 1e5)

#preprocessing the data
scaler = preprocessing.StandardScaler().fit(X_red_training)
X_training_scaled = scaler.transform(X_red_training)
X_test_scaled = scaler.transform(X_red_test)

#learn from training set
regression_model = logreg3.fit(X_training_scaled, Y_training)

If everything is ok, the code below uses the model in $logreg$ to plot the decision region for the two features chosen above, with colors denoting the predicted value. It also plots the points (with correct labels) in the training set. It makes a similar plot for the test set.

In [None]:
# Plot the decision boundary. For that, we will assign a color to each
# point in the mesh [x_min, x_max]x[y_min, y_max].
h = .02  # step size in the mesh
x_min, x_max = X_red[:, 0].min() - .5, X_red[:, 0].max() + .5
y_min, y_max = X_red[:, 1].min() - .5, X_red[:, 1].max() + .5
xx, yy = np.meshgrid(np.arange(x_min, x_max, h), np.arange(y_min, y_max, h))

Z = logreg3.predict(np.c_[xx.ravel(), yy.ravel()])

# Put the result into a color plot
Z = Z.reshape(xx.shape)

plt.figure(1, figsize=(5, 4))
plt.pcolormesh(xx, yy, Z, cmap=plt.cm.Paired)

# Plot also the training points
plt.scatter(X_red_training[:, 0], X_red_training[:, 1], c=Y_training, edgecolors='k', cmap=plt.cm.Paired)
plt.xlabel(feature_name0)
plt.ylabel(feature_name1)

plt.xlim(xx.min(), xx.max())
plt.ylim(yy.min(), yy.max())
plt.xticks(())
plt.yticks(())
plt.title('Training set')

plt.show()

# Put the result into a color plot
Z = Z.reshape(xx.shape)
plt.figure(1, figsize=(5, 4))
plt.pcolormesh(xx, yy, Z, cmap=plt.cm.Paired)

# Plot also the test points 
plt.scatter(X_red_test[:, 0], X_red_test[:, 1], c=Y_test, edgecolors='k', cmap=plt.cm.Paired, marker='s')
plt.xlabel(feature_name0)
plt.ylabel(feature_name1)

plt.xlim(xx.min(), xx.max())
plt.ylim(yy.min(), yy.max())
plt.xticks(())
plt.yticks(())
plt.title('Test set')

plt.show()