# Perceptron

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/ChemAI-Lab/AI4Chem/blob/main/website/modules/03-perceptron.ipynb)

**References:**
1. **Chapters 4**: [Pattern Recognition and Machine Learning](https://www.microsoft.com/en-us/research/wp-content/uploads/2006/01/Bishop-Pattern-Recognition-and-Machine-Learning-2006.pdf), C. M. Bishop.

In [None]:
import numpy as np
import matplotlib
import matplotlib.pyplot as plt
from matplotlib import animation
from matplotlib.colors import ListedColormap

from IPython.display import HTML

import ipywidgets as widgets
from IPython.display import display


from sklearn.datasets import make_moons, make_classification

# Introduction

To motivate the introduction to neural networks, we have to pay homage to the first "architecture", the **perceptron**. 
First introduced in 1962 by Rosenblatt as a linear discriminant model, this model occupies an important place in the history of pattern recognition algorithms.
A discrimination model aims to predict the class of a data point, meaning the output of the model is a binary variable, 
$$
y = \sigma(\mathbf{w}^\top\phi(\mathbf{x})),
$$
where $y$ can only have two values, $y = {-1,1}$, e.g., $y=-1$ means class 1 and $y=1$ class 2.<br>
$\mathbf{w}$ are the parameters of the model.
$\phi(\mathbf{x})$ the **feature space** representation of $\mathbf{x}$ using a fixed nonlinear transformation $\phi()$. <br>
$\sigma()$ has a special name, **activation function**. We will see later that $\sigma$ will dictate the special properties of neural networks. 

For the perceptron model, $\sigma$ has is a sep function,
$$
\sigma(x) =  \left\{\begin{matrix}
 +1,\quad x \geq 0\\
 -1,\quad x < 0
\end{matrix}\right.
$$

From this setup, the only we can "tune" is $\mathbf{w}$, depending on the value it will be if $y$ will assign class 1 or 2,
$$
y = \left\{\begin{matrix}
 +1,\quad \mathbf{w}^\top\phi(\mathbf{x}) \geq 0 \\
 -1, \quad \mathbf{w}^\top\phi(\mathbf{x}) < 0
\end{matrix}\right.
$$

In [None]:

def get_data(name="linear"):
    if name == "linear":
        X, y = make_classification(
            n_features=2, n_redundant=0, n_informative=2,
            n_clusters_per_class=1, class_sep=1.2, flip_y=0,
            random_state=1
        )

        rng = np.random.RandomState(2)
        X += 2 * rng.uniform(size=X.shape)
        linearly_separable = (X, y)
        dataset = linearly_separable
    elif name == "moons":
        X, y = make_moons(noise=0.1, n_samples=200, )
        dataset = (X, y)
    return dataset

dataset_name = "linear"
dataset = get_data(dataset_name)

cm_bright = ListedColormap(["#FF0000", "#0000FF"])
plt.figure(figsize=(8, 4))
plt.scatter(
    dataset[0][:, 0], dataset[0][:, 1],
    c=dataset[1], cmap=cm_bright,
    edgecolor='k', s=100
)
if dataset_name == "linear":
    plt.title("Linearly separable data", fontsize=16)
elif dataset_name == "moons":
    plt.title("Non-linearly separable data", fontsize=16)

# Total Number of Misclassified Points

The perceptron algorithms uses the total number if wrongly classified points as the error function to adjust the value of $\mathbf{w}$.
However, this "simple" error function is hard to optimize as piecewise constant function of $\mathbf{w}$, with discontinuities wherever a change in w causes the decision boundary to move across one of the data points. We will see in a bit that depending on the value of $\mathbf{w}$ the number of misclassified points change, because of this we can not used standard gradient descent methods like commonly used in BCE. 

## perceptron Criterion
The perceptron criterion associates zero error with any pattern that is correctly classified, whereas for a misclassified pattern $\mathbf{x_i}$ it tries to minimize the quantity $-\mathbf{w}^\top\phi(\mathbf{x})$. The perceptron criterion is therefore given by, 
$$
{\cal L}(\mathbf{w}) = - \sum_j^M \mathbf{w}^\top\phi(\mathbf{x}_i)\,y_i, 
$$
where $\sum_j^M$ is only over the misclassified points. For each point, $\mathbf{w}^\top\phi(\mathbf{x}_i)\,y_i$ if correctly predicted is positive, meaning we aim to minimize the negative of $\mathbf{w}^\top\phi(\mathbf{x}_i)\,y_i$.

The optimization of $\mathbf{w}$ can be carried using gradient-based methods, $\nabla_{\mathbf{w}}{\cal L}(\mathbf{w})$, where given the linearity of $\mathbf{w}^\top\phi(\mathbf{x}_i)\,y_i$, 
$$
\nabla_{\mathbf{w}}{\cal L}(\mathbf{w}) = - \sum_j^M \phi(\mathbf{x}_i)\,y_i.
$$
We can use $\nabla_{\mathbf{w}}{\cal L}$ to "**update**" $\mathbf{w}$ by moving in the negative direction, 
$$
\mathbf{w}_{t+1} = \mathbf{w}_{i} - \eta \nabla_{\mathbf{w}}{\cal L},
$$
where $\eta$ is a scalar parameter commonly known as the **learning rate parameter**.
Note that, as the weight vector evolves during training, the set of patterns that are misclassified will change.


## Perceptron Algorithm

$$
\begin{array}{l}
\textbf{Perceptron Algorithm} \\ \hline
\textbf{Input: } \{(\mathbf{x}_i, y_i)\}_{i=1}^N,\; y_i \in \{-1,+1\},\; \eta,\; T \\
\textbf{Initialize: } \mathbf{w} \\[0.4em]
\textbf{for } t = 1 \textbf{ to } T \quad\text{ // iterations} \\ 
\quad \textbf{for } i = 1 \textbf{ to } N \quad\text{ // loop over dataset}\\
\quad\quad \textbf{if } y_i(\mathbf{w}^\top \mathbf{x}_i) \le 0  \quad\text{ // misclassified points}\\
\quad\quad\quad \mathbf{w} \leftarrow \mathbf{w} + \eta y_i \mathbf{x}_i  \quad\text{ // gradient update} \\
\quad\quad \textbf{end if} \\
\quad \textbf{end for} \\
\textbf{end for} \\
\textbf{Output: } \mathbf{w}
\end{array}
$$




In [None]:
def perceptron_train(X_in, y_in, epochs=20, eta=0.01, seed=0):
    rng = np.random.RandomState(seed)
    w = rng.randn(X_in.shape[1], 1)
    history = []
    errors = []
    for epoch in range(epochs):
        m_tot = 0
        for i in range(X_in.shape[0]):
            xi = X_in[i].reshape(-1, 1)
            yi = y_in[i]
            linear_output = np.dot(w.T, xi)[0, 0]
            if yi * linear_output <= 0:
                w += eta * yi * xi
                m_tot += 1
        history.append(w.copy())
        errors.append(m_tot)
        print(f"Epoch {epoch}: Number of misclassified samples={m_tot}, w={w.flatten()}")
    return w, history, errors


In [None]:
X = dataset[0]
y = dataset[1]
y_labels = np.where(y == 0, -1, 1)

use_bias = True
eta = 2E-3
epochs = 100

if use_bias:
    X_train = np.hstack((X, np.ones((X.shape[0], 1))))
else:
    X_train = X

w, w_history, error_ = perceptron_train(
    X_train, y_labels, epochs=epochs, eta=eta
)
y_pred = np.sign(np.dot(X_train, w)).flatten()

def decision_values(grid, w, use_bias):
    if use_bias:
        grid_aug = np.hstack((grid, np.ones((grid.shape[0], 1))))
        return np.dot(grid_aug, w)
    return np.dot(grid, w)


In [None]:
x1 = np.linspace(X[:, 0].min() * 1.1, X[:, 0].max() * 1.1, 200)
x2 = np.linspace(X[:, 1].min() * 0.9, X[:, 1].max() * 1.1, 200)
x1_x2 = np.meshgrid(x1, x2)
X12_grid = np.c_[x1_x2[0].ravel(), x1_x2[1].ravel()]
linear_output = decision_values(X12_grid, w, use_bias)

fig, (ax_left, ax_right) = plt.subplots(1, 2, figsize=(16, 4), gridspec_kw={"width_ratios": [2, 1]})

ax_left.scatter(X[:, 0], X[:, 1], c=y_labels, s=30, cmap=cm_bright, edgecolors="k")
ax_left.contourf(
    x1_x2[0],
    x1_x2[1],
    linear_output.reshape(x1_x2[0].shape),
    levels=[-1e5, 0, 1e5],
    alpha=0.2,
    colors=["red", "blue"],
)
ax_left.set_xlabel(rf"$x_1$", fontsize=16)
ax_left.set_ylabel(rf"$x_2$", fontsize=16)

mis_idx = np.where(y_labels != y_pred)[0]
if mis_idx.size:
    ax_left.scatter(
        X[mis_idx, 0], X[mis_idx, 1],
        c="none", edgecolors="k", s=100, linewidths=2, marker="o"
    )

ax_left.set_title("Perceptron decision regions (final)", fontsize=14)

ax_right.plot(range(len(error_)), error_, marker="o", color="k")
ax_right.set_xlabel("Epoch", fontsize=14)
ax_right.set_ylabel("Misclassified samples", fontsize=14)
ax_right.grid(True, alpha=0.2)

In [None]:
fig, ax_left = plt.subplots(figsize=(8, 4))
ax_left.scatter(X[:, 0], X[:, 1], c=y_labels, s=30, cmap=cm_bright, edgecolors="k")

x_min, x_max = X[:, 0].min() * 1.1, X[:, 0].max() * 1.1
y_min, y_max = X[:, 1].min() * 0.9, X[:, 1].max() * 1.1
ax_left.set_xlim(x_min, x_max)
ax_left.set_ylim(y_min, y_max)

x1 = np.linspace(x_min, x_max, 200)
x2 = np.linspace(y_min, y_max, 200)
x1_x2 = np.meshgrid(x1, x2)
X12_grid = np.c_[x1_x2[0].ravel(), x1_x2[1].ravel()]

line, = ax_left.plot([], [], "k-", linewidth=2)
contour = None
mis_scatter = None

def line_from_w(w):
    if use_bias:
        w0, w1, b = w.flatten()
    else:
        w0, w1 = w.flatten()
        b = 0.0
    if abs(w1) < 1e-12:
        x_vert = -b / w0 if abs(w0) > 1e-12 else 0.0
        return np.array([x_vert, x_vert]), np.array([y_min, y_max])
    x_line = np.array([x_min, x_max])
    y_line = -(w0 * x_line + b) / w1
    return x_line, y_line

def init():
    line.set_data([], [])
    return (line,)

def update(frame):
    global contour, mis_scatter
    w_frame = w_history[frame]
    x_line, y_line = line_from_w(w_frame)
    line.set_data(x_line, y_line)

    linear_output = decision_values(X12_grid, w_frame, use_bias)
    if contour is not None:
        contour.remove()
    contour = ax_left.contourf(
        x1_x2[0],
        x1_x2[1],
        linear_output.reshape(x1_x2[0].shape),
        levels=[-1e5, 0, 1e5],
        alpha=0.2,
        colors=["red", "blue"],
    )

    y_pred_frame = np.sign(np.dot(X_train, w_frame)).flatten()
    mis_idx = np.where(y_labels != y_pred_frame)[0]
    if mis_scatter is not None:
        mis_scatter.remove()
        mis_scatter = None
    if mis_idx.size:
        mis_scatter = ax_left.scatter(
            X[mis_idx, 0], X[mis_idx, 1],
            c="none", edgecolors="k", s=100, linewidths=2, marker="o"
        )

    ax_left.set_title(f"Perceptron boundary - epoch {frame}", fontsize=14)
    return (line,)

ani_line = animation.FuncAnimation(
    fig, update, frames=len(w_history), init_func=init, interval=300, blit=False
)
HTML(ani_line.to_jshtml())


# Sigmoid and BCE

In binary classification, we commonly used the Binary Cross-Entropy error,
$$
{\cal L}_{\text{BCE}} = -\frac{1}{N}\sum_i^N \left (y_i\log(p_i) - (1 - y_i)\log(1- p_i) \right ), 
$$
where $y_i$ is true binary label, $y_i = [0,1]$, $p_i$ is the predicted probability. 
Commonly we can use the **sigmoid function**, 
$$
\sigma(x) = \frac{1}{1+e^{-x}}.
$$

Activation functions are one of the **key** components in modern deep learning models.

In [None]:
from scipy.special import expit as sigmoid

x_grid = np.linspace(-10, 10, 200)
y = sigmoid(x_grid)

plt.figure(figsize=(8, 4))
plt.plot(x_grid, y, color='k', lw=2)
plt.text(6., 0.9, "Class 2", fontsize=16)
plt.text(-7.5, .05, "Class 1", fontsize=16)
plt.title("Sigmoid function", fontsize=16)
plt.ylabel(rf"$\sigma(x)$", fontsize=16)
plt.xlabel("x", fontsize=16)

## Gradient of the Loss (Single Sample)

For a single data point $(\mathbf{x}, y)$, define
$$
z = \mathbf{w}^\top \boldsymbol{\phi}(\mathbf{x}),
\qquad
p = \sigma(z),
\qquad
\sigma(z) = \frac{1}{1 + e^{-z}}.
$$

The binary cross-entropy loss is
$$
\mathcal{L}(\mathbf{w})
= -\bigl[
y \log p
+ (1-y)\log(1-p)
\bigr].
$$

Using the chain rule,
$$
\frac{\partial \mathcal{L}}{\partial \mathbf{w}}
=
\frac{\partial \mathcal{L}}{\partial p}
\frac{\partial p}{\partial z}
\frac{\partial z}{\partial \mathbf{w}}.
$$

Each term is given by
$$
\frac{\partial \mathcal{L}}{\partial p}
= -\left(
\frac{y}{p} - \frac{1-y}{1-p}
\right),
$$
$$
\frac{\partial p}{\partial z}
= p(1-p),
$$
$$
\frac{\partial z}{\partial \mathbf{w}}
= \boldsymbol{\phi}(\mathbf{x}).
$$

Combining terms yields
$$
\frac{\partial \mathcal{L}}{\partial \mathbf{w}}
= \bigl(p - y\bigr)\,\boldsymbol{\phi}(\mathbf{x}).
$$

The full gradient is simply for $N$ data points is, 
$$
\nabla_{\mathbf{w}} \mathcal{L}_{\text{BCE}}
= \frac{1}{N}\sum_{i=1}^N
\bigl(\sigma(z_i) - y_i\bigr)\,
\boldsymbol{\phi}(\mathbf{x}_i),
\qquad
z_i = \mathbf{w}^\top \boldsymbol{\phi}(\mathbf{x}_i).
$$

In [None]:
# Binary cross-entropy (BCE) optimization with a single gradient step

def bce_loss(y_true, y_prob, eps=1e-12):
    y_prob = np.clip(y_prob, eps, 1 - eps)
    return -np.mean(y_true * np.log(y_prob) + (1 - y_true) * np.log(1 - y_prob))

def bce_grad_step(Phi_X, y, w, lr=0.1):
    # Single gradient descent step for logistic regression
    z = X @ w
    y_prob = sigmoid(z)
    error = y_prob - y
    grad_w = Phi_X.T @ error / X.shape[0]
    w = w - lr * grad_w # gradient step
    return w

In [None]:
epochs = 100
X = dataset[0]
y = dataset[1]
y = y.reshape(-1, 1)

use_bias = True

if use_bias:
    X = np.hstack((X, np.ones((X.shape[0], 1))))

# initialize weights

# w = np.zeros((X.shape[1],1))
seed = 0
rng = np.random.RandomState(seed)
w = rng.randn(X.shape[1], 1)

bce_ = []
w_ = []
for epoch in range(epochs):
    w = bce_grad_step(X, y, w, lr=0.1)
    y_prob = sigmoid(np.dot(X, w))
    loss = bce_loss(y, y_prob)
    bce_.append(loss)
    w_.append(w.copy())
    if epoch % 10 == 0:
        print(f"Epoch {epoch}: BCE Loss={loss:.4f}")

# plot BCE
plt.figure(figsize=(8, 4))
plt.plot(range(len(bce_)), bce_, marker="o", color="k")
plt.xlabel("Epoch", fontsize=14)
plt.ylabel("Binary Cross-Entropy Loss", fontsize=14)
plt.grid(True, alpha=0.2)

In [None]:
# Animation using precomputed weights w_ (bias is included in X)

# Expect w_ to be a list/array of shape (steps, n_features)

fig, ax = plt.subplots(figsize=(8, 4))
ax.scatter(X[:, 0], X[:, 1], c=y, s=30, cmap=cm_bright, edgecolors="k")

x_min, x_max = X[:, 0].min() * 1.1, X[:, 0].max() * 1.1
y_min, y_max = X[:, 1].min() * 0.9, X[:, 1].max() * 1.1
ax.set_xlim(x_min, x_max)
ax.set_ylim(y_min, y_max)

x1 = np.linspace(x_min, x_max, 200)
x2 = np.linspace(y_min, y_max, 200)
x1_x2 = np.meshgrid(x1, x2)
X12_grid = np.c_[x1_x2[0].ravel(), x1_x2[1].ravel()]

line, = ax.plot([], [], "k-", linewidth=2)
contour = None

def line_from_w(w):
    if use_bias:
        w0, w1, b = w.flatten()
    else:
        w0, w1 = w.flatten()
        b = 0.0
    if abs(w1) < 1e-12:
        x_vert = -b / w0 if abs(w0) > 1e-12 else 0.0
        return np.array([x_vert, x_vert]), np.array([y_min, y_max])
    x_line = np.array([x_min, x_max])
    y_line = -(w0 * x_line + b) / w1
    return x_line, y_line

def decision_probabilities(grid, w, use_bias):
    if use_bias:
        grid_aug = np.hstack((grid, np.ones((grid.shape[0], 1))))
        scores = np.dot(grid_aug, w)
    else:
        scores = np.dot(grid, w)
    return sigmoid(scores)

def init():
    line.set_data([], [])
    return (line,)

def update(frame):
    global contour
    w_frame = np.asarray(w_[frame])
    bce_value = bce_[frame]
    x_line, y_line = line_from_w(w_frame)
    line.set_data(x_line, y_line)

    prob_output = decision_probabilities(X12_grid, w_frame, use_bias)
    if contour is not None:
        contour.remove()
    contour = ax.contourf(
        x1_x2[0],
        x1_x2[1],
        prob_output.reshape(x1_x2[0].shape),
        levels=[0.0, 0.5, 1.0],
        alpha=0.2,
        colors=["red", "blue"],
    )

    ax.set_title(f"BCE probability surface - step {frame}, BCE={bce_value:.4f}", fontsize=14)
    return (line,)

ani_bce = animation.FuncAnimation(
    fig, update, frames=len(w_), init_func=init, interval=300, blit=False
)
HTML(ani_bce.to_jshtml())
