# Perceptron - Complete Guide

## From Neural Network Foundations to Implementation

The Perceptron is the **simplest neural network** - a single-layer linear binary classifier. It's the foundation of modern deep learning.

### What You'll Learn
1. Perceptron algorithm and history
2. Activation functions
3. Learning rule and convergence
4. Linear separability
5. Implementation from scratch
6. Multi-layer perceptron (MLP)
7. XOR problem

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.datasets import make_classification, make_circles, load_iris
from sklearn.model_selection import train_test_split
from sklearn.linear_model import Perceptron
from sklearn.neural_network import MLPClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score, confusion_matrix

plt.style.use('seaborn-v0_8-whitegrid')
np.random.seed(42)

## 1. Perceptron Model

### Mathematical Model:
$$y = f(\mathbf{w}^T\mathbf{x} + b)$$

Where:
- $\mathbf{x}$ = input features
- $\mathbf{w}$ = weights
- $b$ = bias
- $f$ = activation function (step function)

### Step Activation Function:
$$f(z) = \begin{cases} 1 & \text{if } z \geq 0 \\ 0 & \text{otherwise} \end{cases}$$

In [None]:
# Visualize perceptron architecture
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Step function
z = np.linspace(-5, 5, 100)
step = np.where(z >= 0, 1, 0)

axes[0].plot(z, step, 'b-', linewidth=3)
axes[0].axhline(y=0.5, color='r', linestyle='--', alpha=0.5)
axes[0].axvline(x=0, color='g', linestyle='--', alpha=0.5)
axes[0].set_xlabel('z = w·x + b', fontsize=12)
axes[0].set_ylabel('Output', fontsize=12)
axes[0].set_title('Step Activation Function', fontsize=14)
axes[0].grid(True, alpha=0.3)
axes[0].set_ylim(-0.1, 1.1)

# Perceptron diagram
axes[1].text(0.5, 0.9, 'Perceptron Architecture', ha='center', fontsize=16, fontweight='bold')
axes[1].text(0.15, 0.7, 'x₁', ha='center', fontsize=12, bbox=dict(boxstyle='circle', facecolor='lightblue'))
axes[1].text(0.15, 0.5, 'x₂', ha='center', fontsize=12, bbox=dict(boxstyle='circle', facecolor='lightblue'))
axes[1].text(0.15, 0.3, 'x₃', ha='center', fontsize=12, bbox=dict(boxstyle='circle', facecolor='lightblue'))
axes[1].text(0.5, 0.5, '∑', ha='center', fontsize=20, bbox=dict(boxstyle='circle', facecolor='yellow'))
axes[1].text(0.75, 0.5, 'f', ha='center', fontsize=14, bbox=dict(boxstyle='circle', facecolor='lightgreen'))
axes[1].text(0.9, 0.5, 'y', ha='center', fontsize=12, bbox=dict(boxstyle='circle', facecolor='coral'))

# Arrows
for y_pos, label in zip([0.7, 0.5, 0.3], ['w₁', 'w₂', 'w₃']):
    axes[1].arrow(0.2, y_pos, 0.25, 0.5-y_pos, head_width=0.02, head_length=0.03, fc='black')
    axes[1].text(0.3, (y_pos + 0.5)/2, label, fontsize=10, color='red')

axes[1].arrow(0.55, 0.5, 0.15, 0, head_width=0.03, head_length=0.03, fc='black')
axes[1].arrow(0.8, 0.5, 0.05, 0, head_width=0.03, head_length=0.03, fc='black')
axes[1].text(0.5, 0.2, '+ b (bias)', fontsize=10, color='blue')
axes[1].set_xlim(0, 1)
axes[1].set_ylim(0, 1)
axes[1].axis('off')

plt.tight_layout()
plt.show()

## 2. Perceptron Learning Rule

### Update Rule:
$$w_i \leftarrow w_i + \eta(y_{true} - y_{pred})x_i$$
$$b \leftarrow b + \eta(y_{true} - y_{pred})$$

Where $\eta$ is the learning rate.

In [None]:
class PerceptronScratch:
    """Perceptron implementation from scratch"""
    
    def __init__(self, learning_rate=0.01, n_iterations=1000):
        self.learning_rate = learning_rate
        self.n_iterations = n_iterations
        self.weights = None
        self.bias = None
        self.errors = []
        
    def _activation(self, x):
        """Step activation function"""
        return np.where(x >= 0, 1, 0)
    
    def fit(self, X, y):
        """Train the perceptron"""
        n_samples, n_features = X.shape
        
        # Initialize weights and bias
        self.weights = np.zeros(n_features)
        self.bias = 0
        
        # Convert labels to 0 and 1 if necessary
        y_ = np.where(y <= 0, 0, 1)
        
        # Training
        for iteration in range(self.n_iterations):
            errors = 0
            
            for idx, x_i in enumerate(X):
                # Calculate prediction
                linear_output = np.dot(x_i, self.weights) + self.bias
                y_pred = self._activation(linear_output)
                
                # Update weights and bias
                update = self.learning_rate * (y_[idx] - y_pred)
                self.weights += update * x_i
                self.bias += update
                
                # Count errors
                errors += int(update != 0.0)
            
            self.errors.append(errors)
            
            # Early stopping if converged
            if errors == 0:
                print(f"Converged after {iteration + 1} iterations")
                break
        
        return self
    
    def predict(self, X):
        """Predict class labels"""
        linear_output = np.dot(X, self.weights) + self.bias
        return self._activation(linear_output)
    
    def score(self, X, y):
        """Calculate accuracy"""
        y_pred = self.predict(X)
        y_ = np.where(y <= 0, 0, 1)
        return np.mean(y_pred == y_)

# Generate linearly separable data
X, y = make_classification(n_samples=100, n_features=2, n_redundant=0,
                          n_informative=2, n_clusters_per_class=1,
                          flip_y=0, class_sep=2, random_state=42)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Scale features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Train perceptron
perceptron = PerceptronScratch(learning_rate=0.1, n_iterations=100)
perceptron.fit(X_train_scaled, y_train)

print(f"\nTraining Accuracy: {perceptron.score(X_train_scaled, y_train):.4f}")
print(f"Test Accuracy: {perceptron.score(X_test_scaled, y_test):.4f}")
print(f"Learned Weights: {perceptron.weights}")
print(f"Learned Bias: {perceptron.bias:.4f}")

In [None]:
# Visualize training progress and decision boundary
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Training errors
axes[0].plot(range(1, len(perceptron.errors) + 1), perceptron.errors, 'b-', linewidth=2)
axes[0].set_xlabel('Iteration', fontsize=12)
axes[0].set_ylabel('Number of Misclassifications', fontsize=12)
axes[0].set_title('Perceptron Learning Curve', fontsize=14)
axes[0].grid(True, alpha=0.3)

# Decision boundary
h = 0.02
x_min, x_max = X_train_scaled[:, 0].min() - 1, X_train_scaled[:, 0].max() + 1
y_min, y_max = X_train_scaled[:, 1].min() - 1, X_train_scaled[:, 1].max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, h), np.arange(y_min, y_max, h))

Z = perceptron.predict(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)

axes[1].contourf(xx, yy, Z, alpha=0.4, cmap='RdYlBu')
axes[1].scatter(X_train_scaled[:, 0], X_train_scaled[:, 1], c=y_train,
               cmap='RdYlBu', edgecolors='black', s=100)
axes[1].set_xlabel('Feature 1', fontsize=12)
axes[1].set_ylabel('Feature 2', fontsize=12)
axes[1].set_title('Decision Boundary', fontsize=14)

plt.tight_layout()
plt.show()

## 3. Linear Separability

**Key Limitation**: Perceptron can only learn **linearly separable** patterns.

### Perceptron Convergence Theorem:
If data is linearly separable, the perceptron is **guaranteed to converge** in finite steps.

In [None]:
# Demonstrate linearly separable vs non-separable
fig, axes = plt.subplots(1, 3, figsize=(16, 5))

# Dataset 1: Linearly separable
X1, y1 = make_classification(n_samples=100, n_features=2, n_redundant=0,
                            n_informative=2, n_clusters_per_class=1,
                            flip_y=0, class_sep=2, random_state=42)
axes[0].scatter(X1[:, 0], X1[:, 1], c=y1, cmap='RdYlBu', edgecolors='black', s=100)
axes[0].set_title('Linearly Separable\n(Perceptron will converge)', fontsize=12, fontweight='bold')
axes[0].set_xlabel('Feature 1')
axes[0].set_ylabel('Feature 2')

# Dataset 2: XOR (not linearly separable)
X2 = np.array([[0, 0], [0, 1], [1, 0], [1, 1]])
y2 = np.array([0, 1, 1, 0])  # XOR
axes[1].scatter(X2[:, 0], X2[:, 1], c=y2, cmap='RdYlBu', edgecolors='black', s=300)
axes[1].set_title('XOR Problem\n(NOT Linearly Separable)', fontsize=12, fontweight='bold', color='red')
axes[1].set_xlabel('Feature 1')
axes[1].set_ylabel('Feature 2')
axes[1].set_xlim(-0.5, 1.5)
axes[1].set_ylim(-0.5, 1.5)

# Dataset 3: Circles (not linearly separable)
X3, y3 = make_circles(n_samples=100, noise=0.1, factor=0.5, random_state=42)
axes[2].scatter(X3[:, 0], X3[:, 1], c=y3, cmap='RdYlBu', edgecolors='black', s=100)
axes[2].set_title('Concentric Circles\n(NOT Linearly Separable)', fontsize=12, fontweight='bold', color='red')
axes[2].set_xlabel('Feature 1')
axes[2].set_ylabel('Feature 2')

plt.tight_layout()
plt.show()

## 4. The XOR Problem

The XOR problem was historically significant - it showed the limitation of single-layer perceptrons and led to the development of **multi-layer neural networks**.

In [None]:
# Try to learn XOR with single perceptron (will fail)
X_xor = np.array([[0, 0], [0, 1], [1, 0], [1, 1]])
y_xor = np.array([0, 1, 1, 0])

perceptron_xor = PerceptronScratch(learning_rate=0.1, n_iterations=1000)
perceptron_xor.fit(X_xor, y_xor)

print("Single Perceptron on XOR:")
print(f"Accuracy: {perceptron_xor.score(X_xor, y_xor):.4f}")
print(f"Predictions: {perceptron_xor.predict(X_xor)}")
print(f"True labels: {y_xor}")
print(f"\nFinal errors in last 10 iterations: {perceptron_xor.errors[-10:]}")
print("\n⚠️ Single perceptron CANNOT learn XOR!")

## 5. Scikit-learn Perceptron

In [None]:
# Using sklearn's Perceptron
iris = load_iris()
X_iris = iris.data[:100, :2]  # Use only 2 classes and 2 features
y_iris = iris.target[:100]

X_train_i, X_test_i, y_train_i, y_test_i = train_test_split(
    X_iris, y_iris, test_size=0.3, random_state=42
)

scaler_i = StandardScaler()
X_train_i_scaled = scaler_i.fit_transform(X_train_i)
X_test_i_scaled = scaler_i.transform(X_test_i)

# Train sklearn perceptron
sklearn_perceptron = Perceptron(max_iter=1000, tol=1e-3, random_state=42)
sklearn_perceptron.fit(X_train_i_scaled, y_train_i)

y_pred_i = sklearn_perceptron.predict(X_test_i_scaled)

print(f"Sklearn Perceptron Accuracy: {accuracy_score(y_test_i, y_pred_i):.4f}")
print(f"\nWeights: {sklearn_perceptron.coef_[0]}")
print(f"Bias: {sklearn_perceptron.intercept_[0]:.4f}")
print(f"Number of iterations: {sklearn_perceptron.n_iter_}")

## 6. Multi-Layer Perceptron (MLP)

**Solution to XOR**: Add hidden layers!

MLP can learn non-linear decision boundaries by stacking multiple layers.

In [None]:
# Solve XOR with MLP
mlp = MLPClassifier(hidden_layer_sizes=(4,), activation='relu', 
                    max_iter=10000, random_state=42)
mlp.fit(X_xor, y_xor)

print("Multi-Layer Perceptron on XOR:")
print(f"Accuracy: {mlp.score(X_xor, y_xor):.4f}")
print(f"Predictions: {mlp.predict(X_xor)}")
print(f"True labels: {y_xor}")
print("\n✓ MLP successfully learns XOR!")

# Visualize MLP decision boundary
h = 0.01
x_min, x_max = -0.5, 1.5
y_min, y_max = -0.5, 1.5
xx, yy = np.meshgrid(np.arange(x_min, x_max, h), np.arange(y_min, y_max, h))

Z_mlp = mlp.predict(np.c_[xx.ravel(), yy.ravel()])
Z_mlp = Z_mlp.reshape(xx.shape)

plt.figure(figsize=(10, 6))
plt.contourf(xx, yy, Z_mlp, alpha=0.4, cmap='RdYlBu')
plt.scatter(X_xor[:, 0], X_xor[:, 1], c=y_xor, cmap='RdYlBu', 
           edgecolors='black', s=300, linewidth=2)
plt.xlabel('Input 1', fontsize=12)
plt.ylabel('Input 2', fontsize=12)
plt.title('MLP Solution to XOR Problem', fontsize=14, fontweight='bold')
plt.colorbar(label='Class')
plt.show()

## 7. Comparison: Perceptron vs MLP

In [None]:
# Generate various datasets
datasets = [
    make_classification(n_samples=100, n_features=2, n_redundant=0, n_informative=2,
                       n_clusters_per_class=1, flip_y=0, class_sep=2, random_state=42),
    make_circles(n_samples=100, noise=0.1, factor=0.5, random_state=42),
    (X_xor, y_xor)
]

titles = ['Linearly Separable', 'Circles', 'XOR']

fig, axes = plt.subplots(2, 3, figsize=(16, 10))

for idx, (X_data, y_data) in enumerate(datasets):
    # Train both models
    perc = PerceptronScratch(learning_rate=0.1, n_iterations=1000)
    perc.fit(X_data, y_data)
    
    mlp_model = MLPClassifier(hidden_layer_sizes=(10,), max_iter=10000, random_state=42)
    mlp_model.fit(X_data, y_data)
    
    # Create mesh
    h = 0.02
    x_min, x_max = X_data[:, 0].min() - 0.5, X_data[:, 0].max() + 0.5
    y_min, y_max = X_data[:, 1].min() - 0.5, X_data[:, 1].max() + 0.5
    xx, yy = np.meshgrid(np.arange(x_min, x_max, h), np.arange(y_min, y_max, h))
    
    # Perceptron
    Z_p = perc.predict(np.c_[xx.ravel(), yy.ravel()])
    Z_p = Z_p.reshape(xx.shape)
    
    axes[0, idx].contourf(xx, yy, Z_p, alpha=0.4, cmap='RdYlBu')
    axes[0, idx].scatter(X_data[:, 0], X_data[:, 1], c=y_data, cmap='RdYlBu',
                        edgecolors='black', s=50)
    axes[0, idx].set_title(f'Perceptron: {titles[idx]}\nAcc={perc.score(X_data, y_data):.2f}',
                          fontsize=12, fontweight='bold')
    
    # MLP
    Z_m = mlp_model.predict(np.c_[xx.ravel(), yy.ravel()])
    Z_m = Z_m.reshape(xx.shape)
    
    axes[1, idx].contourf(xx, yy, Z_m, alpha=0.4, cmap='RdYlBu')
    axes[1, idx].scatter(X_data[:, 0], X_data[:, 1], c=y_data, cmap='RdYlBu',
                        edgecolors='black', s=50)
    axes[1, idx].set_title(f'MLP: {titles[idx]}\nAcc={mlp_model.score(X_data, y_data):.2f}',
                          fontsize=12, fontweight='bold')

plt.tight_layout()
plt.show()

## 8. Activation Functions Comparison

In [None]:
# Compare different activation functions
z = np.linspace(-5, 5, 100)

# Activation functions
step = np.where(z >= 0, 1, 0)
sigmoid = 1 / (1 + np.exp(-z))
tanh = np.tanh(z)
relu = np.maximum(0, z)

fig, axes = plt.subplots(2, 2, figsize=(14, 10))
axes = axes.ravel()

activations = [step, sigmoid, tanh, relu]
names = ['Step (Perceptron)', 'Sigmoid', 'Tanh', 'ReLU']
colors = ['blue', 'green', 'red', 'purple']

for idx, (activation, name, color) in enumerate(zip(activations, names, colors)):
    axes[idx].plot(z, activation, color=color, linewidth=3)
    axes[idx].axhline(0, color='gray', linestyle='--', alpha=0.5)
    axes[idx].axvline(0, color='gray', linestyle='--', alpha=0.5)
    axes[idx].set_xlabel('z', fontsize=12)
    axes[idx].set_ylabel('f(z)', fontsize=12)
    axes[idx].set_title(name, fontsize=14, fontweight='bold')
    axes[idx].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print("Activation Functions:")
print("- Step: Used in classic perceptron (non-differentiable)")
print("- Sigmoid: Smooth, outputs [0,1], used in binary classification")
print("- Tanh: Smooth, outputs [-1,1], zero-centered")
print("- ReLU: Fast, most popular in deep learning")

## Summary

### Key Takeaways

1. **Perceptron**: Simplest neural network, linear binary classifier
2. **Learning Rule**: Updates weights based on misclassifications
3. **Convergence**: Guaranteed for linearly separable data
4. **Limitation**: Cannot learn non-linear patterns (XOR problem)
5. **Solution**: Multi-layer perceptron with hidden layers
6. **Historical Importance**: Foundation of deep learning

### Pros and Cons

**Pros:**
- Simple and fast
- Online learning capable
- No hyperparameters (except learning rate)
- Guaranteed convergence for linearly separable data
- Foundation for understanding neural networks

**Cons:**
- Only works for linearly separable data
- No probability outputs (hard predictions)
- Sensitive to feature scaling
- Can oscillate without convergence for non-separable data
- Superseded by better algorithms

### When to Use Perceptron

**Use when:**
- Data is linearly separable
- Need fast, simple baseline
- Online learning required
- Teaching/understanding neural networks

**Avoid when:**
- Non-linear relationships exist
- Need probability outputs
- Data is not linearly separable

### Practice Problems

1. Implement perceptron with different learning rates
2. Modify to use sigmoid activation (logistic regression)
3. Build a 2-layer network to solve XOR from scratch
4. Compare convergence speed on different datasets