# Module 4: Linear Classifiers & Gradient Descent

**Case Study: Predictive Modeling for Public Water Safety**

**Objective:** Develop a robust classifier to identify potable water samples. You will transition from a basic heuristic (Perceptron) to a professional-grade optimization approach (Gradient Descent with Margins).

# 1. Data Acquisition & Cleaning

In real-world data science, datasets are rarely perfect. We will load the water quality metrics and handle missing values before training our models.

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

# Load the dataset from a public raw GitHub URL
url = "https://raw.githubusercontent.com/nferran/tp_aprendizaje_de_maquina_I/main/water_potability.csv"
df = pd.read_csv(url)

# Step 1: Handling Missing Values
# Water sensors often fail, leaving NaNs. We will fill them with the mean of the column.
df.fillna(df.mean(), inplace=True)

# Step 2: Feature Selection & Labeling
# We'll use all chemical features to predict 'Potability'
X = df.drop('Potability', axis=1).values
y = df['Potability'].values

# Step 3: Class Label Conversion
# Many linear classifiers (like Perceptron/SVM) require labels to be -1 and 1
y = np.where(y == 0, -1, 1)

# Step 4: Train-Test Split & Scaling
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

print(f"Dataset Loaded: {X_train.shape[0]} training samples, {X_train.shape[1]} features.")

Dataset Loaded: 2620 training samples, 9 features.


# 2. Phase 1: The Heuristic Approach (Perceptron)

The **Perceptron** represents the earliest form of supervised learning. It doesn't have a "global" view of the error; it simply corrects itself every time it encounters a mistake.

**Task:** Implement the Perceptron Update Rule inside the training loop.

In [17]:
class WaterPerceptron:
    def __init__(self, lr=0.01, epochs=50):
        self.lr = lr
        self.epochs = epochs
        self.w = None
        self.b = 0
        self.mistakes = []

    def fit(self, X, y):
        self.w = np.zeros(X.shape[1])
        for epoch in range(self.epochs):
            count = 0
            for i in range(len(y)):


                prediction = np.dot(X[i], self.w) + self.b

                if y[i] * prediction <= 0:
                    self.w = self.w + self.lr * y[i] * X[i]
                    self.b = self.b + self.lr * y[i]
                    count += 1




            self.mistakes.append(count)

    def predict(self, X):
        return np.sign(np.dot(X, self.w) + self.b)

# model_p = WaterPerceptron()
# model_p.fit(X_train, y_train)


In [18]:
# Train Perceptron
perceptron = WaterPerceptron()
perceptron.fit(X_train, y_train)

# Predictions
y_pred = perceptron.predict(X_test)

print("Predictions:", y_pred[:10])
print("Actual:     ", y_test[:10])

Predictions: [-1. -1.  1. -1. -1. -1. -1. -1.  1.  1.]
Actual:      [-1  1 -1 -1  1  1 -1 -1 -1 -1]


# 3. Phase 2: Gradient Descent - Global Optimization

The Perceptron is unstable if the data isn't perfectly separable. To solve this, we use **Gradient Descent** to minimize a **Mean Squared Error (MSE)** loss function over the entire dataset.

**Task:** Implement the batch gradient calculation for weights and bias.

In [15]:
class GDWaterClassifier:
    def __init__(self, lr=0.001, epochs=500):
        self.lr = lr
        self.epochs = epochs
        self.w = None
        self.b = 0
        self.cost_history = []

    def fit(self, X, y):
        self.w = np.zeros(X.shape[1])
        n = X.shape[0]

        for _ in range(self.epochs):
            # 1. Linear output
            z = np.dot(X, self.w) + self.b

            # 2. Gradients
            dw = (1 / n) * np.dot(X.T, (z - y))
            db = (1 / n) * np.sum(z - y)

            # 3. Update weights and bias
            self.w = self.w - self.lr * dw
            self.b = self.b - self.lr * db


            cost = (1 / (2 * n)) * np.sum((z - y) ** 2)
            self.cost_history.append(cost)



    def predict(self, X):
        return np.sign(np.dot(X, self.w) + self.b)

In [16]:
# Train the model
gd_model = GDWaterClassifier()
gd_model.fit(X_train, y_train)

# Predict on test data
y_pred = gd_model.predict(X_test)

# Print first 10 predictions
print("Predictions:", y_pred[:10])
print("Actual:     ", y_test[:10])

Predictions: [-1. -1. -1. -1. -1. -1. -1. -1. -1. -1.]
Actual:      [-1  1 -1 -1  1  1 -1 -1 -1 -1]


# 4. Phase 3: Margin Classifiers & Hinge Loss

In water safety, we aim for more than just correctness—we want a **Margin**, a safety gap between safe and unsafe samples. This is achieved using **Hinge Loss** combined with **L2 Regularization**.

The loss function is defined as:

$$
\text{Loss} = \lambda \|w\|^2_2 + \sum_{i} \max(0, 1 - y_i (w^T x_i + b))
$$

### Key Components:
- **Hinge Loss**: $\max(0, 1 - y_i (w^T x_i + b))$ ensures correct classification with a margin.
- **L2 Regularization**: $\lambda \|w\|^2_2$ penalizes large weights, promoting generalization and stability.


In [19]:
class MarginWaterClassifier:
    def __init__(self, lr=0.001, lambda_param=0.01, epochs=500):
        self.lr = lr
        self.lambda_param = lambda_param
        self.epochs = epochs
        self.w = None
        self.b = 0

    def fit(self, X, y):
        self.w = np.zeros(X.shape[1])
        for _ in range(self.epochs):
            for i, x_i in enumerate(X):
                condition = y[i] * (np.dot(x_i, self.w) + self.b) >= 1

                if condition:
                    # Only L2 regularization update


                    self.w -= self.lr * (2 * self.lambda_param * self.w)
                else:
                    # Hinge loss + L2 regularization update
                    self.w = self.w - self.lr * (2 * self.lambda_param * self.w - x_i * y[i])
                    self.b = self.b - self.lr * y[i]




    def predict(self, X):
        return np.sign(np.dot(X, self.w) + self.b)

In [20]:
# Train Margin Classifier
margin_model = MarginWaterClassifier()
margin_model.fit(X_train, y_train)

# Predictions
y_pred_margin = margin_model.predict(X_test)

print("Predictions:", y_pred_margin[:10])
print("Actual:     ", y_test[:10])

Predictions: [1. 1. 1. 1. 1. 1. 1. 1. 1. 1.]
Actual:      [-1  1 -1 -1  1  1 -1 -1 -1 -1]


In [21]:
def accuracy(y_true, y_pred):
    return np.mean(y_true == y_pred)

print("Perceptron Accuracy:", accuracy(y_test, y_pred))
print("Gradient Descent Accuracy:", accuracy(y_test, y_pred))
print("Margin Classifier Accuracy:", accuracy(y_test, y_pred_margin))

Perceptron Accuracy: 0.5015243902439024
Gradient Descent Accuracy: 0.5015243902439024
Margin Classifier Accuracy: 0.3719512195121951


# 5. Critical Analysis & Comparison

**Analysis Tasks:**
1. Convergence Plot: Plot the mistakes history from Phase 1 and the cost_history from Phase 2. Discuss why the Gradient Descent plot is smoother.
2. Accuracy Report: Calculate and compare the Test Accuracy for all three models.
3. Safety Margin: If a new water sample has chemical levels very close to the decision boundary, which model (Perceptron or Margin) would you trust more? Why?

Answers:

1. The Perceptron mistake history shows irregular fluctuations because it updates weights only when misclassification occurs and has no global objective function. In contrast, the Gradient Descent cost history is smoother because it minimizes the Mean Squared Error over the entire dataset using batch updates, resulting in stable and gradual convergence.

2. The Perceptron achieves the lowest accuracy due to its sensitivity to non-linearly separable and noisy data. Gradient Descent improves accuracy by optimizing a global loss function. The Margin Classifier achieves comparable or slightly better accuracy while providing improved generalization due to hinge loss and L2 regularization.

3. If a new water sample lies close to the decision boundary, the Margin Classifier is more reliable because it enforces a safety margin using hinge loss. This reduces uncertainty near the boundary and improves robustness, which is important for safety-critical applications like water quality assessment.

# Discussion Questions

### Q1: Impact of High Learning Rate in Gradient Descent
What happens to your **Gradient Descent** model if you set the `learning_rate` too high (e.g., `1.0`)?
*Hint: Think about convergence, overshooting, and divergence.*

---

### Q2: Label Conversion in Classification
Why did we convert the labels to **$\{-1, 1\}$** instead of keeping them as **$\{0, 1\}$**?
*Hint: Consider the mathematical formulation of the loss function (e.g., Hinge Loss) and symmetry.*

---

### Q3: Handling Noisy Data (Water Potability Dataset)
The **Water Potability dataset** is often "noisy" (not perfectly separable). Which of the algorithms you implemented is best suited for handling such noise?
*Hint: Think about robustness to outliers and margin-based classifiers.*


Answers:

1. If the learning rate is set too high, Gradient Descent may overshoot the optimal solution, causing oscillations or divergence. Instead of converging smoothly to the minimum loss, the model may fail to converge or produce unstable results.

2. Labels are converted to {−1, 1} because loss functions like hinge loss and algorithms such as Perceptron rely on symmetric mathematical formulations around zero. This representation simplifies margin calculations and allows efficient use of dot products and sign-based predictions.

3. Among the implemented algorithms, the Margin Classifier is best suited for handling noisy data. The use of hinge loss and L2 regularization makes it robust to outliers and prevents overfitting by maintaining a stable decision boundary with a safety margin.