# Training Simple Machine Learning Algorithms for Classification

In this chapter, we will make use of two of the first algorithmically described machine learning algorithms for classification, the perceptron and adaptive linear neurons. We will start by implementing a perceptron step by step in Python and training it to classify different flower species in the Iris dataset.

## Artificial neurons
Trying to understand how the biological brain works, in order to design AI, Warren McCulloch and Walter Pitts published the first concept of a simplified brain cell, the so-called McCulloch-Pitts (MCP) neuron, in 1943. Neurons are interconnected nerve cells in the brain that are involved in the processing and transmitting of chemical and electrical signals, which is illustrated in the following figure:

<img src="images/neuron_cell.jpeg" alt="Neuron cells" title="Neuron Cells" width="550" height="350">

Only a few years later, Frank Rosenblatt published the first concept of the perceptron learning rule based on the MCP neuron model. With his perceptron rule, Rosenblatt proposed an algorithm that would automatically learn the optimal weight coefficients that are then multiplied with the input features in order to make the decision of whether a neuron fires or not.

### The formal definition of an artificial neuron
More formally, we can put the idea behind artificial neurons into the context of a binary classification task.  We can then define a decision function $(\phi(z))$ that takes a linear combination of certain input values $x$ and a corresponding weight vector $w$, where $z$ is the so-called net input $z = w_1x_1 + \dots + w_mx_m$:

\begin{equation*}
w = \begin{bmatrix}
w_1 \\
\vdots \\
w_m
\end{bmatrix}, x = \begin{bmatrix}
x_1 \\
\vdots \\
x_m
\end{bmatrix}
\end{equation*}

In the perceptron algorithm, the decision function $\phi(\cdot)$ is a variant of a **unit step function**:

\begin{equation*}
    \phi(z) = \left\{ \begin{matrix}
    1 & if z \ge \Theta \\
    -1 & otherwise
    \end{matrix} \right.
\end{equation*}

For simplicity, we can bring the threshold $\theta$ to the left side of the equation and define a weight-zero as $w_0 = -\Theta$ and $x_0 = 1$ so that we write z in a more compact form:

\begin{equation*}
z = w_0x_0 + w_1x_1 + \dots + w_mx_m = w^T x
\end{equation*}

and

\begin{equation*}
    \phi(z) = \left\{ \begin{matrix}
    1 & if z \ge 0 \\
    -1 & otherwise
    \end{matrix} \right.
\end{equation*}

In machine learning literature, the negative threshold, or weight, $w_0 = -\Theta$ , is usually called the *bias unit*.

The following figure illustrates how the net input $z = w^T x$ is squashed into a binary output $(-1 \text{ or } 1)$ by the decision function of the perceptron (left subfigure) and how it can be used to discriminate between two linearly separable classes (right subfigure):

 <img src="images/perceptron_binary_output.jpeg" alt="Perceptron binary Output" title="Perceptron binary Output" width="550" height="350">

### Perceptron learning rule

Rosenblatt's initial perceptron rule is fairly simple and can be summarized by the following steps:

> 1. Initialize the weights to 0 or small random numbers.
> 2. For each training sample $x^{(i)}$:
>   1. Compute the output value $\hat{y}$.
>   2. Update the weights.

Here, the output value is the class label predicted by the unit step function that we defined earlier, and the simultaneous update of each weight in the weight vector w can be more formally written as:

\begin{equation*}
w_j := w_j + \Delta w_j
\end{equation*}

The value of $\Delta w_j$, which is used to update the weight $w_j$ , is calculated by the perceptron learning rule:

\begin{equation*}
\Delta w_j = \eta \left( y^{(i)} - \hat{y}^{(i)} \right)x^{(i)}_j
\end{equation*}

Where $\eta$ is the learning rate, $y^{(i)}$ is the **true class label** of the ith training sample, $\hat{y}^{(i)}$ and is the **predicted class label**. For example, for a two-dimensional dataset, we would write the update as:

\begin{equation*}
\Delta w_0 = \eta\left( y^{(i)} - output^{(i)} \right)
\end{equation*}

\begin{equation*}
\Delta w_1 = \eta \left( y^{(i)} - output^{(i)} \right) x_1^{(i)}
\end{equation*}

\begin{equation*}
\Delta w_2 = \eta \left( y^{(i)} - output^{(i)} \right) x_2^{(i)}
\end{equation*}

It is important to note that the convergence of the perceptron is only guaranteed if the two classes are linearly separable and the learning rate is sufficiently small.

<img src="images/separation_spaces.jpeg" alt="linear and non-linear separable spaces" title="linear and non-linear separable spaces" width="550" height="350">

Let us summarize what we just learned in a simple diagram that illustrates the general concept of the perceptron:

<img src="images/perceptron_diagram.jpeg" alt="Perceptron" title="Perceptron" width="550" height="350">

## Implementation of a Perceptron in Python

In [1]:
# implementa la clase perceptrón

### Training a perceptron model

To test our perceptron implementation, we will load the two flower classes Setosa and Versicolor from the Iris dataset. Although the perceptron rule is not restricted to two dimensions, we will only consider the two features sepal length and petal length for visualization purposes. Also, we only chose the two flower classes Setosa and Versicolor for practical reasons.

In [None]:
import pandas as pd

# carga los datos de iris

In [None]:
# extraer los primeros 100 (50 para iris-setosa, 50 para iris-versicolor) y convertir etiquetas de clase

In [None]:
%matplotlib inline
# graficar los datos obtenidos

In [None]:
# entrenar el perceptron con eta=0.1 y 10 epocas

In [None]:
# grafica el error de clasificación errónea

In [None]:
# una función para graficar resultados
from matplotlib.colors import ListedColormap


def plot_decision_regions(x, y, classifier, resolution=0.02):
    # setup marker generator and color map
    markers = ('s', 'x', 'o', '^', 'v')
    colors = ('red', 'blue', 'lightgreen', 'gray', 'cyan')
    cmap = ListedColormap(colors[:len(np.unique(y))])

    # plot the decision surface
    x1_min, x1_max = x[:, 0].min() - 1, x[:, 0].max() + 1
    x2_min, x2_max = x[:, 1].min() - 1, x[:, 1].max() + 1
    xx1, xx2 = np.meshgrid(np.arange(x1_min, x1_max, resolution), np.arange(x2_min, x2_max, resolution))
    z = classifier.predict(np.array([xx1.ravel(), xx2.ravel()]).T)
    z = z.reshape(xx1.shape)

    plt.contourf(xx1, xx2, z, alpha=0.3, cmap=cmap)
    plt.xlim(xx1.min(), xx1.max())
    plt.ylim(xx2.min(), xx2.max())

    # plot class samples
    for idx, cl in enumerate(np.unique(y)):
        plt.scatter(x=x[y == cl, 0],
                    y=x[y == cl, 1],
                    alpha=0.8,
                    c=colors[idx],
                    marker=markers[idx],
                    label=cl,
                    edgecolors='black')

In [None]:
# utiliza la función anterior para graficar los espacio de clasificación