# Background Information

Since the perceptron rule and Adaline are very similar, we will take the perceptron implementation that we defined earlier and change the ```fit``` method so that the weights are updted by minimising the cost function via gradient descent:

In [1]:
class AdalineGD(object):
    """
    ADAptive LInear NEuron classifier.
    
    ----------
    Parameters
    ----------
    eta: float
        The learning rate (between 0.0 and 1.0)
    n_iter: int
        The number of passes over the training dataset.
    random_state: int
        The Random Number Generator seed for random weight initialisation.
    
    ----------
    Attributes
    ----------
    w_: 1d array
        The weights after fitting.
    cost_: list
        The sum-of-squares cost function value in each epoch (in each pass of n_iter)
    """
    
    def __init__(self, eta=0.01, n_iter=50, random_state=1):
        self.eta = eta
        self.n_iter = n_iter
        self.random_state = random_state
        
    def net_input(self, X):
        """
        Calculate the net input
        """
        return np.dot(X, self.w_[1:]) + self.w_[0]*1.0
    
    def activation(self, X):
        """
        Compute the linear activation function output
        """
        return X
    
    def predict(self, X):
        """
        Return the class label by applying the threshold function
        """
        return np.where(self.activation(self.net_input(X)) >= 0.0, 1, -1)
    
    def fit(self, X, y):
        """
        Fit the training data.
        
        ----------
        Parameters
        ----------
        X: array-like with shape of [n_samples rows, by n_features columns]
            X is the Training dataset, where n_samples is the number of flowers,
            and n_features is the number of dimensions, or columns.
        y: array-like with shape of [n_samples rows].
            y is the target values; the true class labels.
        
        -------
        Returns
        -------
        self: object
        """
        
        rgen = np.random.RandomState(self.random_state)
        self.w_ = rgen.normal(loc=0.0, scale=0.01, size=1 + X.shape[1])
        
        self.cost_ = []
        
        for _ in range(self.n_iter):
            net_input = self.net_input(X)
            output = self.activation(net_input)
            errors = (y - output)  # the true class label minus the calculated outcome
            self.w_[1:] += self.eta * X.T.dot(errors)
            self.w_[0] += self.eta * errors.sum() * 1.0
            cost = (errors**2).sum() / 2.0

# Comments on the above code

Instead of updating the weights after evaluating each individual training sample (as in the perceptron), here we calculate the gradient based on the whole training dataset via
```python
self.eta * errors.sum()
```
for the bias unit (zeroth-weight), and via
```python
self.eta * X.T.dot(errors)
```
for the weights 1 --> m, where ```X.T.dot(errors)``` is a matrix-vector multiplication between our feature matrix and the error vector.

It should be noted that the ```activation``` method has no effect in the code since it is simply an identity function.  Here we added the activation function (computed via the ```activation``` method) to illustrate how information flows through a single-layer neural network: features from the input data, net input, activation, and output.  

In the next chapter we will learn about a logistic regression classifier that uses a non-identity, nonlinear activation function.  We will see that a logistic regression model is closely related to Adaline with the only difference being its activation and cost function.

Now, similar to the previous perceptron implmentation, we collect the cost values in a ```self.cost_``` list to check whether the algorithm converges after a number of epochs.

In practice, it often requires some experimentation to find a good learing rate $\eta$ for optimal convergence.  In the upcoming code, we will choose two different learning rates, $\eta = 0.1$ and $\eta = 0.0001$ to start with and plot the cost functions versus the number of epochs to see how well the Adaline implementation learns from the training data.

## Note on Hyperparameters
The learning rate $\eta$, as well as the number of epochs ```(n_iter```), are the so-called hyperparameters of the perceptron and Adaline learning algorithms.

# Let us now proceed to plot the cost against the number of epochs for 2 different learning rates

In [4]:
import pandas as pd

df = pd.read_csv("/home/henri/stuff/machine_learning/sebastian_raschka/henris_coding/chapter_02/iris.data",
                 header=None)
df.tail()

Unnamed: 0,0,1,2,3,4
145,6.7,3.0,5.2,2.3,Iris-virginica
146,6.3,2.5,5.0,1.9,Iris-virginica
147,6.5,3.0,5.2,2.0,Iris-virginica
148,6.2,3.4,5.4,2.3,Iris-virginica
149,5.9,3.0,5.1,1.8,Iris-virginica


In [None]:
%matplotlib inline