Original paper: [here](https://isl.stanford.edu/~widrow/papers/t1960anadaptive.pdf)

Adeline GD uses Gradient Descent function 
* Loss function: Mean Square Error
$$ L = \frac{1}{2n} \sum_{i=1}^{n} \left( y^{(i)} - \sigma(z^{(i)}) \right) ^ 2 $$
* $\Delta w_j$ changes is calculated with a Gradient Descent function which is a derivative of MSE with respect to chosen $w_j$
$$ \Delta w_j = - \eta \cdot \frac{\partial L}{\partial w_j} $$
$$ \frac{\partial L}{\partial w_j} = -\frac{1}{n} \sum_{i=1}^{n} \left( y^{(i)} - \sigma(z^{(i)}) \right) \cdot x_j 

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

In [None]:
class AdalineGD:
  "ADAptive LInear NEuron classifier"
  def __init__(self, eta=0.01, n_iter=50, seed=1):
    self.eta = eta
    self.n_iter = n_iter
    self.seed = seed
  def fit(self, x, y):
    rng = np.random.RandomState(self.seed)
    self.w = rng.normal(loc=0, scale=0.01, size=x.shape[1])
    self.b = 0
    self.losses = []

    for i in range(self.n_iter):
      net_input = self.net_input(x)
      output = self.activation(net_input)
      errors = y - output
      self.w += self.eta * 2 * x.T.dot(errors) / x.shape[0]
      self.b += self.eta * 2 * np.mean(errors)
      loss = np.mean(errors**2)
      self.losses.append(loss)
    return self
  def predict(self, x_i):
    return np.where(self.activation(self.net_input(x_i)) >= 0.5, 1, 0)
  def net_input(self, x_i):
    return x_i @ self.w + self.b
  def activation(self, x):
    return x

In [None]:
from matplotlib.colors import ListedColormap

def plot_decision_regions(axes, x, y, classifier, resolution=0.02):
    markers = ('o', 's', '^', 'v', '<')
    colors = ('b', 'r', 'lightgreen', 'gray', 'cyan')
    labels = ('Setosa', 'Virginica')
    cmap = ListedColormap(colors[:len(np.unique(y))])

    # plot the decision surface
    x0_min, x0_max = x[:, 0].min()-1, x[:, 0].max()+1
    x1_min, x1_max = x[:, 1].min()-1, x[:, 1].max()+1
    xx0, xx1 = np.meshgrid(np.arange(x0_min, x0_max, resolution), np.arange(x1_min, x1_max, resolution))
    lab = classifier.predict(np.array((xx0.ravel(), xx1.ravel())).T)
    lab = lab.reshape(xx0.shape)

    ax.contourf(xx0, xx1, lab, alpha=0.3, cmap=cmap)
    ax.set_xlim(xx0.min(), xx0.max())
    ax.set_ylim(xx1.min(), xx1.max())

    # plot class examples
    for idx, cl in enumerate(np.unique(y)):
        ax.scatter(x[y == cl, 0], x[y == cl, 1], alpha=0.8, c=colors[idx], marker=markers[idx], label=labels[idx], edgecolor='k')
    ax.set_xlabel("Sepal length [standardized]")
    ax.set_ylabel("Petal length [standardized]")
    ax.legend()

In [None]:
url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data'
df = pd.read_csv(url, header=None, encoding='utf-8')
df.tail()

In [None]:
## preprocess data
# should contain 50xIris-setosa and 50xIris-virginica (head it)
y = df.iloc[:100, 4]
# Iris-setosa to 0, Iris-virginica to 1
y = np.where(y == 'Iris-setosa', 0, 1)
# we are interested in sepal length (idx=0) and petal length (idx=1)
x = df.iloc[:100, [0, 2]].values

`ada1` is an example of too big learning rate, instead of converging losses oscillate around minimum and skyrocket

`ada2` is an example of too small learning rate, it will converge eventually but number of epochs required for it is large, which is undesired

In [None]:
ada1 = AdalineGD(n_iter=15, eta=0.1).fit(x, y)

fig, ax = plt.subplots(nrows=1, ncols=2, figsize=(10, 4))

ax[0].plot(range(1, len(ada1.losses) + 1), np.log10(ada1.losses), marker='o')
ax[0].set_xlabel('Epochs')
ax[0].set_ylabel('log(Mean squared error)')
ax[0].set_title('Adaline - Learning rate 0.1')

ada2 = AdalineGD(n_iter=15, eta=0.0001).fit(x, y)
ax[1].plot(range(1, len(ada2.losses) + 1), ada2.losses, marker='o')
ax[1].set_xlabel('Epochs')
ax[1].set_ylabel('Mean squared error')
ax[1].set_title('Adaline - Learning rate 0.0001')

Gradient descent method benefits from *standarization* of dataset.

Basically we shift mean of every feature to 0 which helps if one parameter has a mean which is very large and one does not.

Then the second one would converge instantly and the second one would take a lot of epochs.

If all of them are "near" zero, this difference balances out when performing standarization.

Standarization: $$x_j^{'} = \frac{x_j - \mu_j}{\sigma_j}$$

$\sigma_j$ - standard deviation of feature $x_j$

$\mu_j$ - mean of feature $x_j$


In [None]:
x_std = np.copy(x)
x_std[:,0] = (x[:,0] - np.mean(x[:,0]))/np.std(x[:,0])
x_std[:,1] = (x[:,1] - np.mean(x[:,1]))/np.std(x[:,1])
ada_gd = AdalineGD(n_iter=20, eta=0.5).fit(x_std, y)

fig, ax = plt.subplots()
plot_decision_regions(ax, x_std, y, classifier=ada_gd)
ax.set_title('Adaline - Gradient descent')
fig.set_tight_layout(True)
plt.show()

fig, ax = plt.subplots()
ax.plot(range(1, len(ada_gd.losses) + 1), ada_gd.losses, marker='o')
ax.set_xlabel('Epochs')
ax.set_ylabel('Mean squared error')
fig.set_tight_layout(True)
plt.show()
