**Author:** Shahab Fatemi

**Email:** shahab.fatemi@umu.se   ;   shahab.fatemi@amitiscode.com

**Created:** 2024-11-10

**Last update:** 2025-09-14

**MIT License** — Shahab Fatemi (2025); For use in the *Machine Learning in Physics* course, Umeå University, Sweden; See the full license text in the parent folder.

<hr>

📢 <span style="color:red"><strong> Note for Students:</strong></span>

* Before working on the labs, review your lecture notes.

* Please read all sections, code blocks, and comments **carefully** to fully understand the material. Throughout the labs, my instructions are provided to you in written form, guiding you through the materials step-by-step.

* All concepts covered in this lab are part of the course and may be included in the final exam.

* I strongly encourage you to work in pairs and discuss your findings, observations, and reasoning with each other.

* If something is unclear, don't hesitate to ask.

* Exercise submission is not required; these tasks are designed to help you practice, explore the concepts, and learn by doing.

* I have done my best to make the lab files as bug-free (and error-free) as possible, but remember: *there is no such thing as bug-free code.* If you observed any bugs, errors, typos, or other issues, I would greatly appreciate it if you report them to me by email. Verbal notifications are not work, as I will likely forget 🙂

ENJOY WORKING ON THIS LAB.
***

# 🛠️ Purpose and Learning Outcomes:

The main focus of this lab is binary classification, using two fundamental methods:
  - Perceptron
  - Logistic Regression

You will also learn about the confusion matrix, a key tool for evaluating classification accuracy and precision.

***

In [None]:
import sys
import os
sys.path.append(os.path.abspath('../utils'))
from notebook_config import *

# Creating a list of colors based on the "tab10" colormap.
# I like the color set in the "tab10" colormap.
cmap = plt.colormaps["tab10"]
colors = [cmap(i) for i in range(21)]

# Different Marker for Scatter plot
markers = ['o', 's', '*', 'x', '^', 'v', '<', '>']  # Different markers for different classes

# Binary Classification

**Overview:** Binary classification aims to categorize (classifies) input data into one of two distinct categories (or classes).

In the code section below, I've defined different functions to generate datasets with different geometric patterns and separations for classification. They provide samples of both linearly and non-linearly separable data. Based on our needs, we are going to call these functions in this notebook, because WE NEED DATA. You will see, in the upcoming code-section, the outcome of each function. So for now, quickly read what each function does and then run the code below. There should be no outputs, and it should run with no error. 

In [None]:
from sklearn.datasets import make_classification, make_circles, make_moons, make_blobs

# This function uses sklearn "make_classification" to generate a dataset with 
# two informative features, and one cluster per class.
# I've intentionally used a non-42 rand state here! Do not change it!
def make_regular_data(n_samples=500, rand_state=90):
    x, y = make_classification(n_samples=n_samples, n_features=2, n_redundant=0,
                               n_informative=2, n_clusters_per_class=1, random_state=rand_state)
    return x, y

# This function creates points uniformly distributed in a square and 
# labels them based on whether they lie above or below the line y = x, 
# producing a simple linear decision boundary.
def make_diagonal_data(n_samples=500, random_state=42):
    np.random.seed(random_state)

    # Generate uniform data in range [0, 5]
    x = np.random.uniform(0, 5, size=(n_samples, 2))

    # true labels based on y > x
    y = (x[:, 1] > x[:, 0]).astype(int)
    return x, y

# This function generates a dataset with a non-linear XOR pattern such that points are labeled 
# as class 1 if exactly one of their coordinates is greater than 2.5, and class 0 otherwise. 
# This creates a checkerboard-like separation.
def make_xor_data(n_samples=500, random_state=42):
    np.random.seed(random_state)

    # Generate uniform data in range [0, 5]
    x = np.random.uniform(0, 5, size=(n_samples, 2))

    # XOR condition: label is 1 if one of the coordinates is >2.5 and the other is <=2.5
    y = (((x[:, 0] >  2.5) & (x[:, 1] <= 2.5)) | 
         ((x[:, 0] <= 2.5) & (x[:, 1] >  2.5))).astype(int)
    return x, y

# This function produces a dataset consisting of two noisy concentric circles: 
# an inner and outer ring are labeled as different classes. 
# This is a classic example of a non-linearly separable dataset, useful for testing non-linear classifiers.
def make_concentric_circles(n_samples=500, factor=0.3, noise=0.1):
    x, y = make_circles(n_samples=n_samples, factor=factor, noise=noise, random_state=42)
    return x, y

# Similar to the circle function, this function makes two interleaving half circles.
def make_two_half_moons(n_samples=500, noise=0.2):
    x, y = make_moons(n_samples=n_samples, noise=noise, random_state=42)
    return x, y

# This function generates a dataset with three distinct clusters using Gaussian blobs.
def make_three_classes(n_samples=500, noise=0.7):
    x, y = make_blobs(n_samples=n_samples,
                    centers=[[0, 5], [2, 0], [5, 4]], #[[0, 5], [2, 0], [5, 4], [2, 6]],
                    cluster_std=noise,
                    random_state=42)
    return x, y

Let's:
1. Generate data using a few data generators above,
2. Split the data into training and validation sets, and
3. Visualize the training data to understand its structure.

These are the standard first steps in any machine learning workflow.

In [None]:
# General data visualization function
def visualize_data(x, y, title):
    plt.figure()
    
    classes = np.unique(y) # Get unique class labels

    for i, label in enumerate(classes):
        plt.scatter(
            x[y == label, 0],
            x[y == label, 1],
            marker=markers[i % len(markers)],
            color=colors[i], edgecolor="k" ,
            s=50, alpha=0.7, 
            label=f"Class {label}")

    plt.xlabel("Feature 1 (x1)")
    plt.ylabel("Feature 2 (x2)")
    plt.title(title)
    plt.legend()
    plt.grid(True)
    plt.show()

#### ⚠️ Data splitting
Note that you need to split data into training and validation (test) sets before performing analysis.

In [None]:
from sklearn.model_selection import train_test_split

x, y = make_regular_data()

# Since the data is sparse, I make a larger validation set (30%)
X_train, X_val, y_train, y_val = train_test_split(x, y, 
                                                  test_size=0.3,    # 30% validation
                                                  stratify =y,      # Stratified split
                                                  shuffle  =True,   # Shuffle the data
                                                  random_state=42)

visualize_data(X_train, y_train, "Regular Data (Training Set)")

In [None]:
x, y = make_concentric_circles()

# Since the data is sparse, I make a larger validation set (30%)
X_train, X_val, y_train, y_val = train_test_split(x, y, 
                                                  test_size=0.3,    # 30% validation
                                                  stratify =y,      # Stratified split
                                                  shuffle  =True,   # Shuffle the data
                                                  random_state=42)

visualize_data(X_train, y_train, "Concentric Circles Data (Training Set)")

***
### ✅ Check your understanding

- What does "stratified split" mean in the `train_test_split` function used above, and why is the split based on the "y" values? (e.g., see line 7 in the previous code block.)

***

## 🔗 Perceptron Learning

(I assume you have already implemented your Perceptron algorithm, as previously recommended and emphasized in the class.)

**Overview:** Here, we develop a **perceptron** classifier from scratch and apply it to various datasets to explore how a simple linear model learns to separate different classes over time. Some of our generated datasets from the previous code section are linear functions, while some others are non-linear. The perceptron model developed here is not applicable to the non-linear data.

### Development

The perceptron is trained using data generated by our datasets with labels (y) converted to $y \in \{-1, 1\}$ to align with the original algorithm's formulation. During training, the perceptron iteratively adjusts its weights and bias whenever it misclassifies a sample. Over multiple epochs, the model learns how to find a separating hyperplane.

The perceptron makes predictions using this rule: $y^* = \text{sign}(\mathbf{w} \cdot \mathbf{x} + b)$

This is exactly what the `predict` function implements in the class below. In the code section below, find the `predic` function and compare its implementation with the mathematical formula above. They should match.

In addition to the `preduct`, I've another function named `train`. This is what the `train` does:
- If the prediction is **correct**, no change is made.  
- If the prediction is **incorrect**, the weights need to be updated to shift the decision boundary in the *right* direction. 

The update rule that adjusts the weights in the *right* direction is given by $\mathbf{w}_{\text{new}} = \mathbf{w}_{\text{old}} + \eta (y - y^*) \mathbf{x}$ where $\eta$ is the learning rate (also see lecture notes).

What I've explained above is directly implimented in the `train` function. Additionally, the `train` function stores wights and bias at each epoch together with the *misclassification percentage* at each epoch for later analysis.

Now that you have done these pre-studies, understanding the code below is simple, because the remaining functions are used to visualize the results. You should also compare the above explanation with the Perceptron Algorithm I explained in the class.

⚠️ NOTE: The model we have developed below is an iterative model, and therefore, it needs to start with pre-defined values for the Weights ($\mathbf{w}$) and Bias (${w_0}$). Over different iterations, the model will update the Weights and Biases to find the most optimal solution (i.e., minimizing the error). 

In [None]:
from IPython.display import clear_output

# Simple Perceptron Class for binary classification
class SimplePerceptron:
    # Initialize the perceptron with training data and hyper-parameters
    def __init__(self, x, y, eta=0.1, epochs=20):
        self.x       = x        # Input features
        self.y       = y        # Target labels
        
        self.eta     = eta      # Learning rate
        self.epochs  = epochs   # Number of epochs
        
        self.w       = np.zeros(x.shape[1])  # Initial Weights (these are required for the iterative approach to begin with)
        self.b       = 0        # Initial Bias (these are required for the iterative approach to begin with)

        # Housekeeping data for tracking and further analysis
        self.coeff_hist  = []       # History of model coefficients (Weight and Bias) over epochs
        self.mis_percent = []       # Percentage of mis-classified samples over epochs

    # This function makes a prediction for input x using the linear 
    #   decision rule, explained in the markdown cell above.
    def predict(self, x):
        return np.where(np.dot(x, self.w) + self.b >= 0, +1, -1)

    # This function trains a simple linear classifier using perceptron algorithm over several iterations. 
    # For each training example, it updates the model's weights and bias if the prediction is incorrect. 
    # It also stores a history for Weight and Bias as well as the errors over epochs.
    def train(self):
        for epoch in range(self.epochs):
            mis_count = 0 # Count misclassifications
            for x_i, y_i in zip(self.x, self.y):  # Iterate over every training example (x_i, y_i)
                y_star = self.predict(x_i)        # Predict the label y^*
                update = self.eta * (y_i - y_star)
                if(update != 0):     # If the prediction is incorrect
                    mis_count += 1   # Increment the misclassification counter
                self.w += update * x_i  # Update weights
                self.b += update        # Update bias
            self.coeff_hist.append((self.w.copy(), self.b)) # Store the history of weights and bias
            self.mis_percent.append(mis_count*100/self.y.shape[0]) # Calculate misclassification percentage and keep its history
            
            # Visualize decision boundary for current epoch
            self.plot_decision_boundary(epoch)

    # Visualize data and decision boundary for the given epoch
    def plot_decision_boundary(self, epoch):
        clear_output(wait=True)
        plt.figure()

        classes = np.unique(self.y) # Get unique class labels

        # Plot training data
        for i, label in enumerate(classes):
            plt.scatter(
                self.x[self.y == label, 0],
                self.x[self.y == label, 1],
                marker=markers[i % len(markers)],
                color=colors[i], edgecolor="k" ,
                s=50, alpha=0.7, 
                label=f"Class {label}")

        # plot decision boundaries
        x_vals = np.linspace(self.x[:, 0].min() - 1, self.x[:, 0].max() + 1, 200)
        for i, (w, b) in enumerate(self.coeff_hist):
            if(w[1] == 0):
                continue
            y_vals = -(w[0] * x_vals + b) / w[1] # Decision boundary equation
            intensity = (i + 1) / self.epochs    # Calculate intensity for color mapping
            color = (1 - intensity, 1 - intensity, 1 - intensity) # Grayscale color mapping
            plt.plot(x_vals, y_vals, color=color, linewidth=2.0, alpha=0.8)
        
        plt.xlabel('Feature 1 (x1)')
        plt.ylabel('Feature 2 (x2)')
        plt.title(f'Perceptron Learning (Epoch {epoch + 1})')
        plt.legend()
        plt.show()

    # plot misclassifications
    def plot_misclassified_history(self):
        plt.figure()
        plt.plot(range(1, self.epochs + 1), self.mis_percent, marker='o', color='forestgreen', linewidth=2, alpha=0.7)
        plt.xlabel('Epoch')
        plt.ylabel('Errors (%)')
        plt.title('Misclassifications history')
        plt.show()
        
    def plot_weights_bias_history(self):
        plt.figure()
        plt.plot(range(1, self.epochs + 1), [wb[0][0] for wb in self.coeff_hist], "-", color=colors[0], linewidth=2, label="w1")
        plt.plot(range(1, self.epochs + 1), [wb[0][1] for wb in self.coeff_hist], "-", color=colors[1], linewidth=2, label="w2")
        plt.plot(range(1, self.epochs + 1), [wb[1] for wb in self.coeff_hist]   , "-", color=colors[2], linewidth=2, label="bias (w0)")
        plt.xlabel('Epoch')
        plt.ylabel('Value')
        plt.title('Weights and Bias history')
        plt.legend()
        plt.show()
        

# ========== MAIN ==========
# Generate data
x, y = make_regular_data()
# x, y = make_diagonal_data()
# x, y = make_xor_data()
# x, y = make_concentric_circles()
y = np.where(y == 0, -1, 1)  # Convert labels to -1 and 1

# Since the data is sparse, I make a larger validation set (30%)
X_train, X_val, y_train, y_val = train_test_split(x, y, 
                                                  test_size=0.3,    # 30% validation
                                                  stratify =y,      # Stratified split
                                                  shuffle  =True,   # Shuffle the data
                                                  random_state=42)

# Run the Simple Perceptron
perceptron = SimplePerceptron(X_train, y_train, eta=0.1, epochs=20)
perceptron.train()


In [None]:
print("Model coefficients:", perceptron.w)
print("Bias (w0):", perceptron.b)
print(rf'Decision Boundary: {perceptron.w[0]:+.2f} x_1 {perceptron.w[1]:+.2f} x_2 {perceptron.b:+.2f} = 0')

Now we want to visualize the misclassification history.

In [None]:
# Now we want to visualize the misclassification history
perceptron.plot_misclassified_history()

And then, we want to plot the evolution of weights and bias over epochs.

In [None]:
# the evolution of weights and bias over epochs
perceptron.plot_weights_bias_history()

***
### ✅ Check your understanding

- Why is $y \in {-1, +1}$ used for the Perceptron algorithm? Is this requirement essential?

- After running the code for 20 iterations, do you think the Perceptron has converged to a final solution?

- What exactly is shown in the miscladsification figure? Which type of error does it represent?

- Why doesn't the misclassification error reach zero? Would you expect it to reach zero if the number of epochs is increased? Try running the algorithm for more epochs. What do you observe, and why?

- How does the misclassification rate change over the epochs, and what can you conclude from the figure? Does your conclusion align with the weight and bias history plot?

***

### Model evaluation

Let's evaluate our model with evaluation metrics. First, we look at the confusion matrix (or contingency matrix). A confusion matrix is a simple table used to evaluate how well a classification model performs. It compares the model's predictions with the actual outcomes and shows where the model was correct or made mistakes. This breakdown helps identify specific areas for improvement. 
The confusion matrix has four key categories:
- True Positive (TP): The model correctly predicted a positive outcome (and the actual outcome was positive).
- True Negative (TN): The model correctly predicted a negative outcome (and the actual outcome was negative).
- False Positive (FP): The model incorrectly predicted a positive outcome (the actual outcome was negative). This is also called a Type I error.
- False Negative (FN): The model incorrectly predicted a negative outcome (the actual outcome was positive). This is also called a Type II error.

Here is an example of a confusion matrix:
|                          | **Predicted Positive** | **Predicted Negative** |
|--------------------------|------------------------|------------------------|
| **Actual Positive**      | True Positive (TP)     | False Negative (FN)    |
| **Actual Negative**      | False Positive (FP)    | True Negative (TN)     |


⚠️ NOTE: All evaluation metrics need to be performed on the validation (or test) sets.

In [None]:
import seaborn as sns
from sklearn.metrics import confusion_matrix

# Make predictions on validation data
y_pred = perceptron.predict(X_val)

# Compute confusion matrix
confusion = confusion_matrix(y_val, y_pred)
labels = ["Class -1", "Class +1"]

# Plot
plt.figure(figsize=(4, 4))
sns.heatmap(confusion, annot=True, fmt="d", cmap="Blues",
            xticklabels=labels, yticklabels=labels,
            cbar=False, linewidths=1, linecolor="black")
plt.xlabel("Predicted Label")
plt.ylabel("True Label")
plt.title("Confusion Matrix")
plt.show()


***
### ✅ Check your understanding
- What do you see from the Confusion Matrix and what does it tell us about the model performance?

### 💡 Reflect and Run

- Increase the number of epochs from 20 to 100 and run the training again. I recommend opening a new code cell and writing all necessary code there, without modifying the earlier cells. This is because you will need the results from the previous run for comparison. After training the new model, check if it converges to a solution with more iterations. Then, perform a full model evaluation using performance metrics (here, the confusion matrix), and compare your results with those from the earlier model.

- Now generate data using `make_diagonal_data()` and run your code again. What do you observe?

- Now try `make_xor_data()` or `make_concentric_circles()` to generate data and run your code. What do you observe and what does the confusion matrix tell you about the model? Remeber, these functions are non-linear, while our developed Perceptron works on **linear** data.

***

## Perceptron from sklearn

In addition to implementing our own perceptron from scratch, you can also use `sklearn.linear_model.Perceptron`, which provides efficient implementation of the perceptron algorithm with some additional features like regularization, and early stopping. We can use that:

In [None]:
from sklearn.linear_model import Perceptron

x, y = make_regular_data()
y = np.where(y == 0, -1, 1)  # Convert labels to -1 and 1

# Since the data is sparse, I make a larger validation set (30%)
X_train, X_val, y_train, y_val = train_test_split(x, y, 
                                                  test_size=0.3,    # 30% validation
                                                  stratify =y,      # Stratified split
                                                  shuffle  =True,   # Shuffle the data
                                                  random_state=42)

# Create a Perceptron classification model
clf_model = Perceptron(max_iter=20, eta0=0.1)

# Train the model
clf_model.fit(X_train, y_train)

# Get the weights and bias
w = clf_model.coef_
b = clf_model.intercept_
print("Model coefficients:", w)
print("Bias (w0):", b)  
print(rf'Decision Boundary: {w[0,0]:+.2f} x_1 {w[0,1]:+.2f} x_2 {b[0]:+.2f} = 0')

In [None]:
# Make predictions
y_pred = clf_model.predict(X_val)

# Compute confusion matrix
confusion = confusion_matrix(y_val, y_pred)
labels = ["Class -1", "Class +1"]

# Plot
plt.figure(figsize=(4, 4))
sns.heatmap(confusion, annot=True, fmt="d", cmap="Blues",
            xticklabels=labels, yticklabels=labels,
            cbar=False, linewidths=1, linecolor="black")
plt.xlabel("Predicted Label")
plt.ylabel("True Label")
plt.title("Confusion Matrix")
plt.show()

Compare the decision boundary and the confusion matrix you got from sklearn with those you got earlier from my implementation. Any difference? 🤓

***

## Logistic Regression

**Overview:** Logistic regression is a statistical method for binary classification. Unlike linear models like the perceptron, which produce hard binary outputs (e.g., 0 or 1, or –1 or +1) based on a step function applied to the weighted sum of inputs, logistic regression uses the sigmoid (logistic) function to produce a smooth probabilisitc output between 0 and 1. 

In the code below, we train our regular binary dataset using logistic regression, implemented in scikit-learn. The model learns a linear decision boundary by fitting the data to a sigmoid function.

⚠️ NOTE: The sigmoid function produces values between 0 and 1 (do you know why?), and therefore, unlike previous sections, we do not need to convert our labels to -1 and +1.

In [None]:
from sklearn.linear_model import LogisticRegression

# This is our regular data.
x, y = make_regular_data()

## This line is commented out, and is not needed.
# y = np.where(y == 0, -1, 1)  # Convert labels to -1 and 1

# Since the data is sparse, I make a larger validation set (30%)
X_train, X_val, y_train, y_val = train_test_split(x, y, 
                                                  test_size=0.3,    # 30% validation
                                                  stratify =y,      # Stratified split
                                                  shuffle  =True,   # Shuffle the data
                                                  random_state=42)

# Create a Logistic Regression model
lr_model = LogisticRegression()

# Train the model
lr_model.fit(X_train, y_train)

# print the weights or model coefficients from the classifier
w = lr_model.coef_
b = lr_model.intercept_
print("Model coefficients:", w)
print("Intercept (w0):", b)
print(rf'Decision Boundary: {w[0,0]:+.2f} x_1 {w[0,1]:+.2f} x_2 {b[0]:+.2f} = 0')

***
### ✅ Check your understanding
- Compare the decision boundary obtained from logistic regression with the one learned by the perceptron (both my implementation and the one from sklearn). Are the boundaries similar in shape and orientation? Do they separate the classes in the same way? Should they be similar or not?

- Use the model coefficients and intercept values to theoretically calculate $P(y=1|(x_1=-0.5, x_2=0.0))$. What does that value mean? See your lecture notes on how to calculate $P$.
***

Let's visualize the decision boundary. I have written a function that plots the data and creates a mesh grid covering the input space. For each point on the grid, the trained model predicts the probability of the positive class. These probabilities are visualized using a heatmap. The decision boundary (at 0.5) is highlighted. Additional contour lines at levels like 10%, 30%, 70%, and 90% are also shown to illustrate how the model's confidence varies across the space. 

Since I will be using the visaulization function for both probabilistic and non-problabilistic models, I have added a control parameter to enable or disable probability-based visualizations. This allows the same function to handle models that output class probabilities (like logistic regression) as well as those that don't (like the perceptron).

In [None]:
def plot_boundary_decision(x, y, model, probabilistic=True):
    plt.figure()

    classes = np.unique(y)   # Get unique class labels
    n_classes = len(classes)  # Number of classes

    # Plot the data points
    for i, label in enumerate(classes):
        plt.scatter( x[y == label, 0],
                    x[y == label, 1],
                    marker=markers[i % len(markers)],
                    color=colors[i], edgecolor="k",
                    s=50, alpha=0.7,
                    label=f"Class {label}" )

    # create a 100x100 mesh grid for maping the decision boundary on it
    xx, yy = np.meshgrid(
        np.linspace(x[:, 0].min() - 0.5, x[:, 0].max() + 0.5, 100),
        np.linspace(x[:, 1].min() - 0.5, x[:, 1].max() + 0.5, 100) )
    grid = np.c_[xx.ravel(), yy.ravel()]

    if probabilistic:
        probs = model.predict_proba(grid)  # shape: (num_points, n_classes)
        if probs.shape[1] == 2:
            # Binary classification take the positive class probability
            probs_display = probs[:, 1].reshape(xx.shape)
        else:
            # Multiclass: take max probability as "confidence" for current decision
            probs_display = np.max(probs, axis=1).reshape(xx.shape)
    else:
        preds = model.predict(grid)
        probs_display = preds.reshape(xx.shape)

    # Plot contour lines
    if probabilistic:
        prob_levels = [0.1, 0.3, 0.7, 0.9]
    else:
        prob_levels = [0.5]

    for p in prob_levels:
        contour = plt.contour(xx, yy, probs_display, levels=[p], linestyles="--", linewidths=1.0)
        plt.clabel(contour, fmt={p: f'{int(p * 100)}%'})

    # Plot decision boundary (0.5 level or class change)
    if probabilistic:
        plt.contour(xx, yy, probs_display, levels=[0.5], colors='k', linewidths=2)
    else:
        plt.contour(xx, yy, probs_display, levels=np.arange(n_classes + 1) - 0.5, colors='k', linewidths=2)

    # Heatmap background
    plt.contourf(xx, yy, probs_display, levels=100, cmap="coolwarm", alpha=0.3)

    plt.xlabel("Feature 1 (x1)")
    plt.ylabel("Feature 2 (x2)")
    plt.title("Decision Boundary")
    plt.legend()
    plt.grid(True)
    plt.show()


## ========== MAIN ==========
plot_boundary_decision(X_train, y_train, lr_model)


***
### ✅ Check your understanding
- Study the figure above. Where is the decision boundary and what are those solid and dashed lines?

- Locate the point $x_1=-0.5, x_2=0.0$ on the decision boundary plot. Compare the predicted probability $P$ you previously calculated at this point using the logistic regression model with the color and contour value shown in the figure.
***

In [None]:
# Make predictions
y_pred = lr_model.predict(X_val)

# Compute confusion matrix
confusion = confusion_matrix(y_val, y_pred)
labels = ["Class -1", "Class +1"]

# Plot
plt.figure(figsize=(4, 4))
sns.heatmap(confusion, annot=True, fmt="d", cmap="Blues",
            xticklabels=labels, yticklabels=labels,
            cbar=False, linewidths=1, linecolor="black")
plt.xlabel("Predicted Label")
plt.ylabel("True Label")
plt.title("Confusion Matrix")
plt.show()

***
### ✅ Check your understanding
- Compare the confusion matrix obtained from logistic regression with the one learned by the perceptron (both my implementation and the one from sklearn). Why is the one from logisitic regression different?

### 💡 Reflect and Run
- Read more about `LogisticRegression()` from the [official scikit-learn website](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html) to understand the available options for regularization in Logisitic Regression. Modify your code to use both **L1** (Lasso) and **L2** (Ridge) regularization by setting the appropriate values for the `penalty`, `solver`, and `C` parameters. Train the logistic regression model separately with each regularization and compare the results with those you got earlier using the default settings. Discuss how regularization influences the model and under what conditions one might be preferred over the other.

- To generate a new data, use
```python
    x, y = make_regular_data( rand_state=870 )
```
visualize the data and test your model and its performance.
***


## Non-Linear Classification with Polynomial Perceptron

Let's move back to the `Perceptron` again. This time, we want to apply it to a non-linear classification problem. Wait, what? Yes, to a non-linear problem, and it works, similar to the closed-form solution we applied to polynomial functions. Let's figure it out.

The code below defines a custom classifier that combines polynomial feature expansion with a perceptron to handle non-linear classification tasks. By transforming input data into a higher-dimensional polynomial space, we allow the linear perceptron to learn complex decision boundaries 🤓. The model is trained and evaluated using a pipeline, with data  Standardizeation. 

Here, we use the `Perceptron` function from scikit-learn.

In [None]:
from sklearn.preprocessing import StandardScaler, PolynomialFeatures
from sklearn.pipeline import make_pipeline

# Load or generate your data
x, y = make_regular_data()
y = np.where(y == 0, -1, 1)  # Convert labels to -1 and 1

X_train, X_val, y_train, y_val = train_test_split(x, y, 
                                                  test_size=0.3,    # 30% validation
                                                  stratify =y,      # Stratified split
                                                  shuffle  =True,   # Shuffle the data
                                                  random_state=42)

# Create pipeline: standardize, polynomial transform, and then perceptron
poly_perc_model = make_pipeline( StandardScaler(),  # Standardize features
                        PolynomialFeatures(degree=2),   # You can change degree to 3, 4, etc.
                        Perceptron(max_iter=100, eta0=1.0, random_state=42) )

# Train
poly_perc_model.fit(X_train, y_train)

# Get the trained perceptron model (last step in the pipeline)
perc = poly_perc_model[-1]

# Print coefficients and intercept
print("Coefficients:", perc.coef_)
print("Intercept:   ", perc.intercept_)

# Plot decision boundary
plot_boundary_decision(X_train, y_train, poly_perc_model, probabilistic=False)


In [None]:
# print accuracy for train and test datasets
print("Train accuracy:", poly_perc_model.score(X_train, y_train))
print("Validation accuracy:", poly_perc_model.score(X_val, y_val))

***
### 💡 Reflect and Run

- Carefully study the code above and make sure you understand how the pipeline works. Then, modify the polynomial degree from 2 to 3, 5, 7, and higher. Observe how the decision boundary changes as the degree increases. Does the model start to overfit?

- For each polynomial degree, plot the confusion matrix and compare the results. What patterns or trends do you notice across degrees? Are there any similarities in classification performance? Reflect on the trade-off between model complexity and generalization.

- Now you can apply the pipeline to the non-linear datasets we generated earlier: try `make_concentric_circles` and `make_two_half_moons` and test it for low- and high-order polynomials.

- Let's do something even more cool! Simply replace the `Perceptron` line in the pipeline with the `LogisticRegression()` and re-run the code for different datasets, different polynomial degrees, with and without L1 and L2 regularizations, and find the best model. Do not forget to set `probabilistic=True` when you call `plot_boundary_decision` function. What you seeing is the beauty of scikit-learn and its powerful pipeline.
***

## Multiclass Classifiers

Up to this point, our focus was on **binary classification** to distinguishing between two classes. But what if we have more than two? How can we extend our approach to handle multi-class classification?

The code below demonstrates the **One-vs-Rest (OvR)** strategy using logistic regression. We have data with three classes (0, 1, 2); three separate models are trained, and during prediction, the class with the highest confidence score is selected.

#### ⚠️ Important Note
Before moving on, take a look at the training data. It's important to emphasize that you should not plot, analyze, or use the validation/test data during training or visualization. Keep the test set **completely untouched** until the final evaluation. This ensures a fair and unbiased assessment of your model's selection and performance.

In [None]:
x, y = make_three_classes( noise=1.0)

# Since the data is sparse, I make a larger validation set (30%)
X_train, X_val, y_train, y_val = train_test_split(x, y, 
                                                  test_size=0.3,    # 30% validation
                                                  stratify =y,      # Stratified split
                                                  shuffle  =True,   # Shuffle the data
                                                  random_state=42)

visualize_data(X_train, y_train, "Regular Data (Training Set)")

Here, I used scikit-learn's `Perceptron` that can handle OvR. After fitting the model to the training data, it predicts test labels and ebaluates accuracy. A confusion matrix further breaks down performance across the three classes. This approach shows how OvR allows binary models to be adapted for multi-class problems in an effective way.

In [None]:
# Create pipeline: standardize, then train Perceptron
ovr_model = make_pipeline(
    StandardScaler(),
    Perceptron(max_iter=100, eta0=0.5, random_state=42) )

# Fit model
ovr_model.fit(X_train, y_train)

# Predict
y_pred = ovr_model.predict(X_val)

# Plot decision boundary
plot_boundary_decision(X_train, y_train, ovr_model, probabilistic=False)

In [None]:
# Make predictions
y_pred = ovr_model.predict(X_val)

# Compute confusion matrix
confusion = confusion_matrix(y_val, y_pred, labels=[0, 1, 2])
labels = ["Class 0", "Class 1", "Class 2"]

# Plot
plt.figure(figsize=(4, 4))
sns.heatmap(confusion, annot=True, fmt=".1f", cmap="gist_heat_r",
            xticklabels=labels, yticklabels=labels,
            cbar=False, linewidths=1, linecolor="k", 
            cbar_kws={'label': 'Number of Samples'})

plt.title("Confusion Matrix")
plt.xlabel("Predicted label")
plt.ylabel("True label")

plt.xticks(np.arange(3)+0.5, [0, 1, 2])  # Center tick labels
plt.yticks(np.arange(3)+0.5, [0, 1, 2], rotation=0)

plt.show()

***
### ✅ Check your understanding

- Let me challenge you with a question: Compare the results from the confusion matrix with the data classified by the decision boundary in your plot. Do they qualitatively match and **why**? For example, does the number of visibly misclassified points in the plot correspond to the misclassifications reported in the confusion matrix? Think carefully about this before moving on. Don't proceed without a clear explanation. I'll give the answer at the end of this session.

### 💡 Reflect and Run

- Instead of the Perceptron, use the `LogisticRegression` and test your model. 

- Modify the data by changing the `noise` parameter in `make_three_classes` to noise=2.0. Now run the entire OvR-related codes. Analyze the Confusion Matrix and explain your observations. Use `accuracy_score` to measure the overall test accuracy.

- What happens if you add the 4th class to the data? e.g., modify the `make_three_classes` function to accomodate 4 classes with the following centers (you can also write your new function named `make_four_classes`):
```python
[[0, 5], [2, 0], [5, 4], [2, 6]]
```
and now run your model again. Keep the noise level high, but you can modify it and set it to any value you like.

- Why might certain classes be more difficult to distinguish from others?

- To get answer to my question above: make a simple change when plotting the confusion matrix: Replace `X_val` with `X_train` in `ovr_model.predict(...)`, and replace `y_val` with `y_train` in `confusion_matrix(...)`. Now rerun the confusion matrix and compare it to what you saw in the decision boundary plot. Do you notice the connection? What exactly did I ask you to change — and why does it matter?

⚠️ **Note:** While this helps you visually connect predictions to what you see in the training plot, remember that the **correct setup for evaluating performance metrics (including confusion matrix)** is to always use the **validation or test data**, not the training set. The training set is for learning, not evaluation.

***
END
***