# [Artificial Neural Networks](https://en.wikipedia.org/wiki/Artificial_neural_network)

- Artificial neural networks are computing systems vaguely inspired by the human brain.
- The subject was opened by McCulloch and Pitts (1943) by creating a computational model for neural networks.
- The network is built of neurons that are interconnected like a web.
- Each connection, like the synapses in a brain, can transmit a signal (=real number) to other neurons.
- Main types of neural networks:
  + [multilayer perceptron](https://en.wikipedia.org/wiki/Multilayer_perceptron) (old school but still useful)
  + [autoencoder](https://en.wikipedia.org/wiki/Autoencoder) (for dimension reduction and visualization)
  + [convolutional network](https://en.wikipedia.org/wiki/Convolutional_neural_network) (originally developed for image classification)
  + [recurrent network](https://en.wikipedia.org/wiki/Recurrent_neural_network) (originally developed for text classification; example: [LSTM](https://en.wikipedia.org/wiki/Long_short-term_memory))
  + [transformer](https://en.wikipedia.org/wiki/Transformer_(machine_learning_model)) (originally developed for machine translation)
  + competitive network (example: [GAN](https://en.wikipedia.org/wiki/Generative_adversarial_network))
  + ...

## Theory of the [Multilayer Perceptron](https://en.wikipedia.org/wiki/Multilayer_perceptron) in a Nutshell

<img src="../_img/mlp.jpg" width="320px">

- input: $x \in \mathbb{R}^{d \times 1}$<br>
- hidden layer: $h = \sigma(W^T x)$, where $W \in \mathbb{R}^{d \times K}$ and $\sigma$ is the [logistic sigmoid function](https://en.wikipedia.org/wiki/Logistic_function)<br>
- model output: $\hat{y} = \sigma(v^T h)$, where $v \in \mathbb{R}^{K \times 1}$<br>
- the parameters of the model are the matrix $W$ (hidden weights) and the vector $v$ (output weights)

<hr>

- objective function: $CE(W, v) = \sum_{i=1}^n \left( -y_i\log(\hat{y}_i) - (1 - y_i)\log(1 - \hat{y}_i) \right)$<br>
- derivative by $v$: $\frac{d}{dv} CE(W, v) = \sum_{i=1}^n(\hat{y}_i - y_i) h_i$<br>
- derivative by $W$: $\frac{d}{dW} CE(W, v) = \sum_{i=1}^n x_i \varepsilon_i^T$, where $\varepsilon_i = (\hat{y}_i - y_i) v \odot h_i \odot(1 - h_i)$ is the backpropagated error
- the approximate minimization of $CE$ can be done e.g. by [stochastic gradient descent](https://en.wikipedia.org/wiki/Stochastic_gradient_descent)


## The Phishing Websites Problem

The [Phishing Websites](https://archive.ics.uci.edu/ml/machine-learning-databases/00327/Training%20Dataset.arff) data set contains certain attributes of web sites. The target attribute is the last column. It specifies whether the site is legitimate (-1) or phishing (+1). Our goal will be to build an artificial neural network that predicts the value of the target attribute.

**Exercise 1**: Load the Phishing Websites data set to a data frame. Prepare the input matrix and the target vector.

In [1]:
# Load data.
import pandas as pd
from urllib.request import urlopen

url = 'https://archive.ics.uci.edu/'\
      'ml/machine-learning-databases/00327/Training%20Dataset.arff'
lines = urlopen(url).read().decode('utf-8').split('\r\n')
names = [l.split()[1] for l in lines if l.startswith('@att')]
skiprows = lines.index('@data') + 1
df = pd.read_csv(url, names=names, skiprows=skiprows)

In [2]:
df

Unnamed: 0,having_IP_Address,URL_Length,Shortining_Service,having_At_Symbol,double_slash_redirecting,Prefix_Suffix,having_Sub_Domain,SSLfinal_State,Domain_registeration_length,Favicon,...,popUpWidnow,Iframe,age_of_domain,DNSRecord,web_traffic,Page_Rank,Google_Index,Links_pointing_to_page,Statistical_report,Result
0,-1,1,1,1,-1,-1,-1,-1,-1,1,...,1,1,-1,-1,-1,-1,1,1,-1,-1
1,1,1,1,1,1,-1,0,1,-1,1,...,1,1,-1,-1,0,-1,1,1,1,-1
2,1,0,1,1,1,-1,-1,-1,-1,1,...,1,1,1,-1,1,-1,1,0,-1,-1
3,1,0,1,1,1,-1,-1,-1,1,1,...,1,1,-1,-1,1,-1,1,-1,1,-1
4,1,0,-1,1,1,-1,1,1,-1,1,...,-1,1,-1,-1,0,-1,1,1,1,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
11050,1,-1,1,-1,1,1,1,1,-1,-1,...,-1,-1,1,1,-1,-1,1,1,1,1
11051,-1,1,1,-1,-1,-1,1,-1,-1,-1,...,-1,1,1,1,1,1,1,-1,1,-1
11052,1,-1,1,1,1,-1,1,-1,-1,1,...,1,1,1,1,1,-1,1,0,1,-1
11053,-1,-1,1,1,1,-1,-1,-1,1,-1,...,-1,1,1,1,1,-1,1,1,1,-1


In [3]:
df.info

<bound method DataFrame.info of        having_IP_Address  URL_Length  Shortining_Service  having_At_Symbol  \
0                     -1           1                   1                 1   
1                      1           1                   1                 1   
2                      1           0                   1                 1   
3                      1           0                   1                 1   
4                      1           0                  -1                 1   
...                  ...         ...                 ...               ...   
11050                  1          -1                   1                -1   
11051                 -1           1                   1                -1   
11052                  1          -1                   1                 1   
11053                 -1          -1                   1                 1   
11054                 -1          -1                   1                 1   

       double_slash_redirecting

In [4]:
df.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
having_IP_Address,11055.0,0.313795,0.949534,-1.0,-1.0,1.0,1.0,1.0
URL_Length,11055.0,-0.633198,0.766095,-1.0,-1.0,-1.0,-1.0,1.0
Shortining_Service,11055.0,0.738761,0.673998,-1.0,1.0,1.0,1.0,1.0
having_At_Symbol,11055.0,0.700588,0.713598,-1.0,1.0,1.0,1.0,1.0
double_slash_redirecting,11055.0,0.741474,0.671011,-1.0,1.0,1.0,1.0,1.0
Prefix_Suffix,11055.0,-0.734962,0.678139,-1.0,-1.0,-1.0,-1.0,1.0
having_Sub_Domain,11055.0,0.063953,0.817518,-1.0,-1.0,0.0,1.0,1.0
SSLfinal_State,11055.0,0.250927,0.911892,-1.0,-1.0,1.0,1.0,1.0
Domain_registeration_length,11055.0,-0.336771,0.941629,-1.0,-1.0,-1.0,1.0,1.0
Favicon,11055.0,0.628584,0.777777,-1.0,1.0,1.0,1.0,1.0


In [5]:
X = df [df.columns[:-1]].values

In [6]:
y =  ((df['Result'] + 1)/2).values

In [7]:
X.shape, y.shape, X.sum(), y.sum()

((11055, 30), (11055,), 100854, 6157.0)

**Exercise 2**: Implement a multilayer perceptron classifier from scratch! Use stochastic gradient descent for training. Evaluate the model on the Phishing Websites data set using a 70%-30% train-test split!

In [15]:
import numpy as np

def sigmoid(t):
    return (1 / (1 + np.exp(-t)))

class SimpleMLPClassifier:
    def __init__(self, n_hidden=32, init_range=0.1, n_epochs=5, learning_rate=0.01, random_state=42):
        self.n_hidden=n_hidden
        self.init_range=init_range
        self.n_epochs = n_epochs
        self.learning_rate=learning_rate
        self.random_state=random_state
        
    def _forward(self, x_i):
        h_i = sigmoid(self.W.T @ x_i)   # hidden activations
        yhat_i = sigmoid(self.v @ h_i)  # output activation
        return h_i, yhat_i
    
    def fit(self, X, y):
        # model initialization
        n, d = X.shape
        rs = np.random.RandomState(self.random_state)
        self.W = rs.uniform(-self.init_range, self.init_range, (d, self.n_hidden))   # hidden weights
        self.v = rs.uniform(-self.init_range, self.init_range, self.n_hidden)        # output weights
        
        for e in range(self.n_epochs):
            for i in range(n):
                h_i, yhat_i = self._forward(X[i])  # propagate the signal forward
                grad_v = (yhat_i - y[i]) * h_i     # derivative w.r.t. v
                
                eps_i = (yhat_i - y[i]) * self.v * h_i * (1 - h_i)  # backpropagated error
                grad_W = np.outer(X[i], eps_i)   #derivative w.r.t. W
                
                # update model parameters
                self.v -= self.learning_rate * grad_v
                self.W -= self.learning_rate * grad_W
        return self
    
    def predict(self, X):
        return (self.predict_propa(X)[:, 1] > 0.5).astype("int")
    
    def predict_propa(self, X):
        n = X.shape[0]
        Yhat = np.zeros((n, 2))
        for i in range(n):
            _, yhat_i = self._forward(X[i])
            Yhat[i, 1] = yhat_i
        Yhat[:, 0] = 1 - Yhat[:, 1]
        return Yhat

In [17]:
from sklearn.model_selection import ShuffleSplit
from sklearn.metrics import accuracy_score

tr, te = next(ShuffleSplit(test_size=0.3, random_state=42).split(X))

for n_epochs in range(6):
    cl = SimpleMLPClassifier(n_epochs=n_epochs)
    cl.fit(X[tr], y[tr])
    print(n_epochs, accuracy_score(cl.predict(X)[te], y[te]))

0 0.43050949653301174
1 0.9201085318058486
2 0.9182996683750377
3 0.9173952366596322
4 0.9179981911365692
5 0.9176967138981007


**Excercise 3**: Compare the previous solution against scikit-learn's `MLPClassifier`!

In [20]:
from sklearn.neural_network import MLPClassifier

for n_epochs in range(1, 6):
    cl = MLPClassifier(
        hidden_layer_sizes=(32,), activation='logistic', solver='sgd', learning_rate_init=0.01, 
        max_iter=n_epochs, momentum=0, alpha=0, batch_size=1, random_state=42
    ) # the intercept terms cannot be switched OFF
    cl.fit(X[tr], y[tr])
    print(n_epochs, accuracy_score(cl.predict(X)[te], y[te]))



1 0.9186011456135061




2 0.9198070545673802




3 0.9207114862827857




4 0.9222188724751281
5 0.9261380765752185




**Excercise 4**: Optimize the meta-parameters of the neural network!

In [21]:
from sklearn.neural_network import MLPClassifier
mlp = MLPClassifier(max_iter=100)

In [22]:
parameter_space = {
    'hidden_layer_sizes': [(50,50,50), (50,100,50), (100,)],
    'activation': ['tanh', 'relu'],
    'solver': ['sgd', 'adam'],
    'alpha': [0.0001, 0.05],
    'learning_rate': ['constant','adaptive'],
}

In [25]:
from sklearn.model_selection import GridSearchCV

clf = GridSearchCV(mlp, parameter_space, cv=3, refit=True)

In [None]:
clf.fit(X[tr], y[tr])









In [None]:
print('Best parameters found:\n', clf.best_params_)

In [None]:
from sklearn.metrics import classification_report
print('Results on the test set:')
print(classification_report(y[te], clf.predict(X)[te]))

In [None]:
# Teacher solution
def evaluate(cl, X, y):
    tr, te = next(ShuffleSplit(test_size=0.3, random_state=42).split(X))
    cl.fit(X[tr], y[tr])
    return accuracy_score(cl.predict(X[te]), y[te])

In [None]:
from sklearn.model_selection import GridSearchCV

cl = MLPClassifier()
param_grid = {
    'learning_rate_init': [0.01, 0.02, 0.05],
    'max_iter': [50, 75],
    'batch_size': [48, 64, 72],
    'random_state': [42],
}

cv = ShuffleSplit(n_splits=1, test_size=0.3, random_state=42)
gs = GridSearchCV(cl, param_grid, cv=cv, verbose=2)
gs.fit(X, y)

In [None]:
df_res = pd.DataFrame(gs.cv_results_)
columns = ['param_batch_size', 'param_learning_rate_init', 'param_max_iter', 'split0_test_score']
df_res.sort_values(columns[-1])[::-1][columns]

In [None]:
df_res