# Hidden layers and random projections

Neural networks hold the promise of being able to perform very well in highly complex classification problems. It is frequently claimed that 'deep' neural networks do that by virtue of their large number of hidden layers. Indeed, the number of layers is important.

## The Perceptron

One of the first neural networks, the perceptron, had no hidden layer and was thus a simple linear classifier. This meant that it could only correctly classify linearly separable classes. After this was recognised, neural network research was abandoned for a long time.

## MLPs

Multi-Layer Perceptrons simply added more /non-linear/ layers. This allowed them to separate arbitrary classes. However, one large problem was the complexity of adjusting the parameters of the network. The most commonly used methods employ some sort of stochastic gradient descent.

## Random projections

One can think of each layer in the network as projecting the original data to some other space, with the last layer projecting it to the final labelling space. The idea of random projections is simply to project the data using a /random/ transformation. This avoids having to estimate the right parameters for it. In practice, RPs perform quite well: for large dimension projections, they male classes separable. If the initial features are very high dimensional, then projecting to a smaller dimensionality space randomly has good guarantees according to the Johnson-Lindenstrauss lemma https://en.wikipedia.org/wiki/Johnson%E2%80%93Lindenstrauss_lemma

## DNNs

However, sometimes we'd like our hidden layers to have more /meaning/ than just be random. This is especially the case in complex architectures where we might want to perform multiple tasks. For example in vehicular vision applications, you'd like to do background, road and object detection, weather recognition simultaneously. Rather than training one network for each task, they can share the first few layers and only use one specialised output layer for each task. The idea is that somehow you would be able to learn a globally useful representation in the first few layers: which is why a lot of research in DNNs is about learning representations.

In [4]:
print(__doc__)
import pandas
from sklearn import datasets, neighbors, linear_model
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score
from sklearn.neural_network import MLPClassifier

features = ["Alcohol", "Malic acid", "Ash", "Alcalinity of ash",
    "Magnesium", "Total phenols", "Flavanoids", "Nonflavanoid phenols",
    "Proanthocyanins", "Color intensity", "Hue",
    "OD280/OD315 of diluted wines", "Proline"]


target = 'Class'

df = pandas.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/wine/wine.data', names=[target] + features)
train_data_s, test_data_s = train_test_split(df, test_size=0.2)

# Reduce number of features used
features = ["Alcohol", "Malic acid", "Ash", "Alcalinity of ash",
    "Magnesium"]

model = KNeighborsClassifier(n_neighbors=3).fit(train_data_s[features], train_data_s[target])
score = accuracy_score(test_data_s[target], model.predict(test_data_s[features]))

knn = neighbors.KNeighborsClassifier()
knn.fit(train_data_s[features], train_data_s[target])
logistic = linear_model.LogisticRegression()
logistic.fit(train_data_s[features], train_data_s[target])
mlp = MLPClassifier(solver='lbfgs', alpha=1e-5, hidden_layer_sizes=(10, 5, 2), random_state=1)
mlp.fit(train_data_s[features], train_data_s[target])


print ('Training result')
print('KNN score: %f' % knn.score(train_data_s[features], train_data_s[target]))
print('LogisticRegression score: %f' %logistic.score(train_data_s[features], train_data_s[target]))
print("MLP score: %f" % mlp.score(train_data_s[features], train_data_s[target]))


print ('Testing result')
print('KNN score: %f' % knn.score(test_data_s[features], test_data_s[target]))
print('LogisticRegression score: %f' % logistic.score(test_data_s[features], test_data_s[target]))
print('MLP score: %f' % mlp.score(test_data_s[features], test_data_s[target]))



Automatically created module for IPython interactive environment
Training result
KNN score: 0.802817
LogisticRegression score: 0.767606
MLP score: 0.366197
Testing result
KNN score: 0.666667
LogisticRegression score: 0.694444
MLP score: 0.527778


# Random projections

A random projection is simply a function which has been randomly selected. In this particular case, the projection has the form $$f(x) = \tanh(A x) $$ with $A$ being a matrix with Gaussian-distributed entries. Here $x \in R^n$ and $A \in R^{m \times n}$. If $m < n$ then the projection compresses the input. If on the other hand $m >n$, the input is expanded. This means that even very simple models can classify the training data perfectly when $m$ is large enough. 

In [2]:
from sklearn import random_projection
import numpy

n_p_features = 1000
transform = random_projection.GaussianRandomProjection(n_p_features)
X_train_new = numpy.tanh(transform.fit_transform(train_data_s[features]))
X_test_new = numpy.tanh(transform.fit_transform(test_data_s[features]))

knn = neighbors.KNeighborsClassifier()
knn.fit(X_train_new, train_data_s[target])
logistic = linear_model.LogisticRegression()
logistic.fit(X_train_new, train_data_s[target])
mlp = MLPClassifier(solver='lbfgs', alpha=1e-5, hidden_layer_sizes=(5, 2), random_state=1)
mlp.fit(X_train_new, train_data_s[target])

print ('Training result')
print('KNN score: %f' % knn.score(X_train_new, train_data_s[target]))
print('LogisticRegression score: %f' %logistic.score(X_train_new, train_data_s[target]))
print("MLP score: %f" % mlp.score(X_train_new, train_data_s[target]))


print ('Testing result')
print('KNN score: %f' % knn.score(X_test_new, test_data_s[target]))
print('LogisticRegression score: %f' % logistic.score(X_test_new, test_data_s[target]))
print('MLP score: %f' % mlp.score(X_test_new, test_data_s[target]))



Training result
KNN score: 0.866197
LogisticRegression score: 0.866197
MLP score: 0.387324
Testing result
KNN score: 0.444444
LogisticRegression score: 0.277778
MLP score: 0.277778




# Examine performance

In this lab, spend some time to gauge the performance of each method, with and without random projections applied. See whether using a bigger and more complex dataset from the UCI repository makes a difference. Can you say that one method is necessarily "better"?