In order to follow this tutorial, you should install the latest version of Scikit Learn.It can be easily installed using pip or conda. For complete installation, follow the official installation document : http://scikit-learn.org/stable/install.html

Data :
We'll use SciKit Learn's built in Iris Data Set.The data set contains 3 classes of 50 instances each, where each class refers to a type of iris plant. One class is linearly separable from the other 2; the latter are NOT linearly separable from each other. 

Predicted attribute: class of iris plant. 

Attribute Information:

1. sepal length in cm 
2. sepal width in cm 
3. petal length in cm 
4. petal width in cm 
5. Class Labels: 
    - Iris Setosa 
    - Iris Versicolour 
    - Iris Virginica

Let's go ahead and start by getting the data!

In [1]:
import numpy as np
import pandas as pd
from sklearn.datasets import load_iris
iris = load_iris()

ModuleNotFoundError: No module named 'sklearn'

This object is like a dictionary, it contains a description of the data and the features and targets:

In [25]:
iris.keys()

dict_keys(['data', 'target', 'target_names', 'DESCR', 'feature_names'])

We can view the attributes of data by selecting 'features_names'

In [26]:
iris['feature_names']

['sepal length (cm)',
 'sepal width (cm)',
 'petal length (cm)',
 'petal width (cm)']

Let's set up our Data and our Labels:

In [27]:
X = iris['data']
y = iris['target']

Let's split our data into training and testing sets, this is done easily with SciKit Learn's train_test_split function from model_selection: which split arrays or matrices into random train and test subsets

In [28]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y)


View the first ten samples of testing data and associated class labels

In [29]:
print ('Test data')
print (X_test[:10])
print ('Associated Class labels')
print (y_test[:10])

Test data
[[6.7 3.1 5.6 2.4]
 [6.3 3.3 6.  2.5]
 [4.8 3.4 1.9 0.2]
 [5.3 3.7 1.5 0.2]
 [6.3 2.7 4.9 1.8]
 [6.9 3.2 5.7 2.3]
 [5.9 3.2 4.8 1.8]
 [6.5 2.8 4.6 1.5]
 [5.8 2.8 5.1 2.4]
 [7.7 2.8 6.7 2. ]]
Associated Class labels
[2 2 0 0 2 2 1 1 2 2]


Before making any actual predictions, it is always a good practice to scale the features so that all of them can be uniformly evaluated.

Since the range of values of raw data varies widely, in some machine learning algorithms, objective functions will not work properly without normalization. For example, the majority of classifiers calculate the distance between two points by the Euclidean distance. If one of the features has a broad range of values, the distance will be governed by this particular feature. Therefore, the range of all features should be normalized so that each feature contributes approximately proportionately to the final distance.

The gradient descent algorithm (which is used in neural network training and other machine learning algorithms) also converges faster with normalized features.

The following script performs feature scaling:

In [30]:
from sklearn.preprocessing import StandardScaler  
scaler = StandardScaler()  
scaler.fit(X_train)

X_train = scaler.transform(X_train)  
X_test = scaler.transform(X_test)  

Multi-layer Perceptron (MLP) is a supervised learning algorithm that learns a function \begin{equation*} f(\cdot): R^m \rightarrow R^o \end{equation*} by training on a dataset, where m is the number of dimensions for input and o is the number of dimensions for output. Given a set of features \begin{equation*} X = {x_1, x_2, ..., x_m} \end{equation*} and a target y, it can learn a non-linear function approximator for either classification or regression. It is different from logistic regression, in that between the input and the output layer, there can be one or more non-linear layers, called hidden layers. Figure 1 shows a one hidden layer MLP with scalar output.

Class MLPClassifier implements a multi-layer perceptron (MLP) algorithm that trains using Backpropagation. MLP trains on two arrays: array X of size (n_samples, n_features), which holds the training samples represented as floating point feature vectors; and array y of size (n_samples,), which holds the target values (class labels) for the training samples:

In [31]:
from sklearn.neural_network import MLPClassifier
clf = MLPClassifier(solver='lbfgs', alpha=1e-5,hidden_layer_sizes=(6, 3), random_state=1)

In [32]:
clf.fit(X_train, y_train)       

MLPClassifier(activation='relu', alpha=1e-05, batch_size='auto', beta_1=0.9,
       beta_2=0.999, early_stopping=False, epsilon=1e-08,
       hidden_layer_sizes=(6, 3), learning_rate='constant',
       learning_rate_init=0.001, max_iter=200, momentum=0.9,
       nesterovs_momentum=True, power_t=0.5, random_state=1, shuffle=True,
       solver='lbfgs', tol=0.0001, validation_fraction=0.1, verbose=False,
       warm_start=False)

After fitting (training), the model can predict labels for new samples. 

In [33]:
y_pred=clf.predict(X_test)
print (y_pred)

[2 2 0 0 2 2 2 1 2 2 1 0 2 2 0 0 2 1 0 2 2 1 0 0 0 2 0 2 0 2 2 2 2 1 0 1 0
 2]


We can calculate prediction accuracy easily by using Scikit Learn's function accuracy_score.

In [34]:
from sklearn.metrics import accuracy_score
accuracy_score(y_test, y_pred)

0.8421052631578947

Currently, MLPClassifier supports only the Cross-Entropy loss function, which allows probability estimates by running the predict_proba method.
MLP trains using Backpropagation. More precisely, it trains using some form of gradient descent and the gradients are calculated using Backpropagation. For classification, it minimizes the Cross-Entropy loss function, giving a vector of probability estimates P(y|x) per sample x:


Predicting the probabilites of first ten instances in X_test

In [35]:
clf.predict_proba(X_test[:10])

array([[3.82757861e-025, 5.48766205e-174, 1.00000000e+000],
       [4.18968128e-025, 3.58282981e-208, 1.00000000e+000],
       [1.00000000e+000, 0.00000000e+000, 1.24582174e-026],
       [1.00000000e+000, 0.00000000e+000, 5.29295035e-033],
       [4.35391346e-017, 6.99803199e-015, 1.00000000e+000],
       [7.81933903e-026, 3.77764061e-185, 1.00000000e+000],
       [6.85534938e-015, 1.84483674e-002, 9.81551633e-001],
       [1.53799226e-047, 1.00000000e+000, 5.07145518e-033],
       [4.60725308e-019, 4.14199267e-076, 1.00000000e+000],
       [5.46374354e-031, 9.71074819e-270, 1.00000000e+000]])

MLPClassifier use parameter alpha for regularization (L2 regularization) term which helps in avoiding overfitting by penalizing weights with large magnitudes. Students should alter the value of alpha and see how it changes the training process and prediction accuracy. 

# Lab assignment

Compare the neural network classifier with the k nearest neighbour classifier on this dataset. Use the crossvalidation or a single holdout set, to select the best parameters for each model. In particular, keep the network architecture fixed, but vary $\alpha$, and keep the kNN distance fixed, but vary $k$. Then test your models in an independent test set to see how they perofrm.

Bonus questions:
1. Is it possible to identify interesting or useful features?
2. If different misclassifications have different costs, write a decision function that takes as input a model, equipped with the method model.predict_proba(), a feature vector $x$ and a real-valued utility matrix $U$ so that $U(y,a)$ is the utility of predicting $a$ when $y$ is the correct class and outputs the class label with maximum expected utility.