# From Scratch to Scikit-learn: Regression & Classification (CSC 422)

**Duration:** 2 hours  
**Format:** Live coding with student participation  
**Course:** CSC 422 - Machine and Deep Learning

---

## Learning Goals

By the end of class, students should:
- Recognize the value of using libraries (scikit-learn) vs. coding from scratch
- Implement regression with scikit-learn and compare with their scratch version
- Understand the general pipeline of supervised ML (fit → predict → evaluate)
- Be introduced to core shallow classifiers (kNN, logistic regression, Naïve Bayes)
- Practice applying models to small datasets with scikit-learn

---

## ⏱ Timeline

- **0–15 min** — Bridge from Scratch to Library
- **15–40 min** — Supervised ML Workflow with Scikit-learn
- **40–60 min** — Transition to Classification
- **60–90 min** — Shallow Classification Models with Scikit-learn
- **90–115 min** — Guided Lab Exercise
- **115–120 min** — Wrap-Up & Forward Look

---

## Setup

We'll need both basic scientific computing tools and scikit-learn for today's exploration.

In [None]:
# Essential imports for scientific computing
import numpy as np
import matplotlib.pyplot as plt

In [None]:
# Scikit-learn imports for machine learning
from sklearn.linear_model import _____________, _____________
from sklearn.neighbors import _____________
from sklearn.naive_bayes import _____________
from sklearn.model_selection import _____________
from sklearn.datasets import _____________, _____________

In [None]:
# Evaluation metrics and utilities
from sklearn.metrics import mean_squared_error, accuracy_score, confusion_matrix
import warnings
warnings.filterwarnings('ignore')

In [None]:
# Reproducibility and verification
np.random.seed(42)
print("✅ All libraries imported successfully!")

---

# 0–15 min: Bridge from Scratch to Library

**Goal:** Connect IC_2's mathematical foundations to professional machine learning tools

## Review: What We Built in IC_2

Last class, we implemented gradient descent from scratch. Today, we'll see how scikit-learn uses the same mathematical principles with a much cleaner interface.

In [None]:
# Recreate the same dataset from IC_2
a_true, b_true = 2.5, -1.0
n_points = 100

# Generate identical noisy linear data
x = np.random.uniform(-2, 2, n_points)
y = a_true * x + b_true + np.random.normal(0, 0.5, n_points)

print(f"Dataset: {n_points} points with noise")

In [None]:
# Train/test split (same as IC_2)
split_idx = int(0.8 * n_points)
x_train, y_train = x[:split_idx], y[:split_idx]
x_test, y_test = x[split_idx:], y[split_idx:]

print(f"Split: {len(x_train)} train, {len(x_test)} test")

## The Manual Approach (IC_2 Review)

Let's quickly recreate our gradient descent solution:

In [None]:
# Our scratch gradient descent (simplified version)
def train_scratch(x, y, learning_rate=0.1, steps=100):
    a, b = 0.0, 0.0
    n = len(x)
    
    for _ in range(steps):
        # Compute gradients
        pred = a * x + b
        grad_a = (2/n) * np.sum((pred - y) * x)
        grad_b = (2/n) * np.sum(pred - y)
        
        # Update parameters
        a -= learning_rate * grad_a
        b -= learning_rate * grad_b
    
    return a, b

In [None]:
# Train with our scratch implementation
a_scratch, b_scratch = train_scratch(x_train, y_train)
scratch_mse = np.mean((y_test - (a_scratch * x_test + b_scratch))**2)

print(f"Scratch result: a={a_scratch:.3f}, b={b_scratch:.3f}")
print(f"Test MSE: {scratch_mse:.4f}")

## The Scikit-Learn Way

Watch how ~15 lines of gradient descent becomes 3 lines of sklearn:

In [None]:
# Reshape data for sklearn (expects 2D arrays)
X_train = x_train.reshape(-1, 1)
X_test = x_test.reshape(-1, 1)

print(f"Reshaped: {x_train.shape} → {X_train.shape}")

In [None]:
# The sklearn magic: fit → predict → evaluate
model = _____________()
model.______(X_train, y_train)
sklearn_mse = mean_squared_error(y_test, model._______(X_test))

print(f"Sklearn result: a={model.coef_[0]:.3f}, b={model.intercept_:.3f}")
print(f"Test MSE: {sklearn_mse:.4f}")

In [None]:
# Compare results
print(f"🎯 COMPARISON:")
print(f"Scratch:  MSE = {scratch_mse:.4f}")
print(f"Sklearn:  MSE = {sklearn_mse:.4f}")
print(f"Difference: {abs(scratch_mse - sklearn_mse):.6f}")
print("✅ Nearly identical results!")

## Discussion Break (2 minutes)

**Question for students:** *"When would you implement from scratch vs. use a library like scikit-learn?"*

**Think about:**
- Learning and understanding
- Production systems
- Custom requirements
- Time constraints

---

# 15–40 min: Supervised ML Workflow with Scikit-learn

**Goal:** Master the universal pipeline that works for ANY supervised learning problem

## The Universal ML Pipeline

Every supervised ML project follows these 5 steps, regardless of algorithm:

In [None]:
# The Universal ML Pipeline
print("🔄 THE 5-STEP ML PIPELINE:")
print("1️⃣  LOAD → Get your dataset")
print("2️⃣  SPLIT → Separate train/test")
print("3️⃣  FIT → Train the model")
print("4️⃣  PREDICT → Make predictions")
print("5️⃣  EVALUATE → Measure performance")

## Step 1: Load Dataset

Let's apply this pipeline to a real medical dataset - diabetes progression prediction:

In [None]:
# Step 1: Load a real dataset
diabetes = load_diabetes()
X_diabetes = diabetes.data      # Features: age, BMI, blood pressure, etc.
y_diabetes = diabetes.target    # Target: disease progression score

print(f"Dataset shape: {X_diabetes.shape}")
print(f"Features: {len(diabetes.feature_names)}")
print(f"Target range: {y_diabetes.min():.0f} to {y_diabetes.max():.0f}")

In [None]:
# Explore the features
print("📊 Medical features measured:")
for i, feature in enumerate(diabetes.feature_names):
    print(f"   {i+1}. {feature}")

print(f"\n🎯 Task: Predict disease progression from medical measurements")
print("   This is REGRESSION (continuous target)")

## Step 2: Split Train/Test

Sklearn provides an automatic splitting function:

In [None]:
# Step 2: Automatic train/test split
X_train_diab, X_test_diab, y_train_diab, y_test_diab = _____________(
    X_diabetes, y_diabetes, test_size=0.2, random_state=42
)

print(f"Training set: {X_train_diab.shape[0]} patients")
print(f"Test set: {X_test_diab.shape[0]} patients")
print(f"Features: {X_train_diab.shape[1]} per patient")

## Steps 3-5: Fit → Predict → Evaluate

The same API pattern we just learned:

In [None]:
# Step 3: Fit the model (same API!)
diabetes_model = _____________()
diabetes_model.______(X_train_diab, y_train_diab)

print(f"✅ Model trained on {len(diabetes.feature_names)} features")
print(f"Intercept: {diabetes_model.intercept_:.1f}")

In [None]:
# Step 4: Make predictions (same API!)
y_train_pred = diabetes_model._______(X_train_diab)
y_test_pred = diabetes_model._______(X_test_diab)

print(f"Predictions made for {len(y_test_pred)} test patients")
print(f"Example: actual={y_test_diab[0]:.0f}, predicted={y_test_pred[0]:.0f}")

In [None]:
# Step 5: Evaluate performance
train_mse = mean_squared_error(y_train_diab, y_train_pred)
test_mse = mean_squared_error(y_test_diab, y_test_pred)
test_r2 = diabetes_model.score(X_test_diab, y_test_diab)

print(f"Train MSE: {train_mse:.1f}")
print(f"Test MSE: {test_mse:.1f}")
print(f"R² score: {test_r2:.3f} (higher = better)")

## Visualize Results

A quick plot to see how well our model performed:

In [None]:
# Visualize model performance
plt.figure(figsize=(10, 4))

# Actual vs Predicted
plt.subplot(1, 2, 1)
plt.scatter(y_test_diab, y_test_pred, alpha=0.6)
plt.plot([y_test_diab.min(), y_test_diab.max()], 
         [y_test_diab.min(), y_test_diab.max()], 'r--')
plt.xlabel('Actual Progression')
plt.ylabel('Predicted Progression')
plt.title(f'Actual vs Predicted (R² = {test_r2:.3f})')

# Feature importance
plt.subplot(1, 2, 2)
importance = np.abs(diabetes_model.coef_)
plt.barh(diabetes.feature_names, importance)
plt.xlabel('Coefficient Magnitude')
plt.title('Feature Importance')
plt.tight_layout()
plt.show()

## Discussion Break (3 minutes)

**Question for students:** *"To switch from LinearRegression to a RandomForest, what would you need to change in our code?"*

**Think about:**
- Which lines stay the same?
- Which lines change?
- Why is this API design powerful?

---

# 40–60 min: Transition to Classification

**Goal:** Understand when to predict categories vs. continuous values

## Regression vs. Classification: The Key Difference

In [None]:
# The fundamental difference
print("🔍 REGRESSION vs CLASSIFICATION:")
print()
print("📈 REGRESSION:")
print("   • Predict NUMBERS: 150.5, 23.7, 89.2")
print("   • Examples: house prices, temperature, disease progression")
print("   • Goal: Find best-fit line/curve")

print("\n🏷️  CLASSIFICATION:")
print("   • Predict CATEGORIES: 'spam', 'cat', 'malignant'")
print("   • Examples: email type, animal species, cancer diagnosis")
print("   • Goal: Find decision boundaries")

## Visual Comparison: Lines vs. Boundaries

Let's see the difference in action:

In [None]:
# Create example data for comparison
np.random.seed(42)

# Regression example
x_reg = np.random.uniform(-2, 2, 50)
y_reg = 1.5 * x_reg + 0.5 + np.random.normal(0, 0.3, 50)

# Classification example
x1_class = np.random.uniform(-2, 2, 60)
x2_class = np.random.uniform(-2, 2, 60)
class_labels = (x1_class + x2_class > 0).astype(int)

print("✅ Example datasets created")

In [None]:
# Visualize the difference
plt.figure(figsize=(12, 4))

# Left: Regression
plt.subplot(1, 2, 1)
plt.scatter(x_reg, y_reg, alpha=0.6)
plt.plot(x_reg, 1.5 * x_reg + 0.5, 'r-', linewidth=2, label='Best-fit line')
plt.xlabel('Input Feature')
plt.ylabel('Continuous Target')
plt.title('REGRESSION\nPredict Numbers')
plt.legend()
plt.grid(True, alpha=0.3)

# Right: Classification
plt.subplot(1, 2, 2)
colors = ['red', 'blue']
for i in range(2):
    mask = class_labels == i
    plt.scatter(x1_class[mask], x2_class[mask], c=colors[i], 
               label=f'Class {i}', alpha=0.6)

# Decision boundary
boundary_x = np.linspace(-2, 2, 100)
boundary_y = -boundary_x
plt.plot(boundary_x, boundary_y, 'k-', linewidth=2, label='Decision boundary')
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.title('CLASSIFICATION\nPredict Categories')
plt.legend()
plt.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

## Three Key Classification Algorithms

Today we'll explore three fundamental approaches:

In [None]:
# Three classification algorithms overview
print("THREE CLASSIFICATION APPROACHES:")
print()
print("k-NEAREST NEIGHBORS (k-NN)")
print("'You are who your neighbors are'")
print("Look at k closest points, vote on class")

print("\nLOGISTIC REGRESSION")
print("'Find the best linear separator'")
print("Draw straight line between classes")

print("NAÏVE BAYES")
print("Use probability with independence assumption'")
print("Apply statistics to classify")

## Discussion Break (2 minutes)

**Question for students:** *"For spam email detection, which algorithm would you try first and why?"*

**Consider:**
- Email features (words, sender, etc.)
- Need for interpretability
- Speed requirements
- Data size

---

# 60–90 min: Shallow Classification Models with Scikit-learn

**Goal:** Apply the same pipeline to classification and compare three algorithms

## Dataset: The Classic Iris Dataset

We'll use the famous iris flowers dataset - perfect for learning classification:

In [None]:
# Step 1: Load iris dataset
iris = load_iris()
X_iris = iris.data        # 4 features: sepal/petal length/width
y_iris = iris.target      # 3 species: setosa, versicolor, virginica

print(f"Dataset: {X_iris.shape[0]} flowers, {X_iris.shape[1]} measurements")
print(f"Species: {iris.target_names}")
print(f"Features: {iris.feature_names}")

In [None]:
# Step 2: Train/test split (same function!)
X_train_iris, X_test_iris, y_train_iris, y_test_iris = train_test_split(
    X_iris, y_iris, test_size=0.3, random_state=42, stratify=y_iris
)

print(f"Training: {X_train_iris.shape[0]} flowers")
print(f"Testing: {X_test_iris.shape[0]} flowers")
print("✅ Balanced split maintained with stratify")

In [None]:
# Let's explore our classification dataset
print("Iris dataset shape:", X_iris.shape, y_iris.shape)
print("Feature names:", iris.feature_names)
print("Target classes:", iris.target_names)

In [None]:
# Visualize the iris dataset (2D projection)
plt.figure(figsize=(10, 4))

plt.subplot(1, 2, 1)
plt.scatter(X_iris[:, 0], X_iris[:, 1], c=y_iris, cmap='viridis')
plt.xlabel('Sepal Length (cm)')
plt.ylabel('Sepal Width (cm)')
plt.title('Iris Dataset: Sepal Dimensions')
plt.colorbar()

In [None]:
plt.subplot(1, 2, 2)
plt.scatter(X_iris[:, 2], X_iris[:, 3], c=y_iris, cmap='viridis')
plt.xlabel('Petal Length (cm)')
plt.ylabel('Petal Width (cm)')
plt.title('Iris Dataset: Petal Dimensions')
plt.colorbar()
plt.tight_layout()
plt.show()

Notice how the data points form distinct clusters by color (species). This is what makes classification possible - we can learn decision boundaries that separate these groups.

### Training Our First Classification Model: k-Nearest Neighbors

Let's start with k-NN, which makes predictions based on the k closest training examples.

In [None]:
# Split the iris data
X_train_iris, X_test_iris, y_train_iris, y_test_iris = train_test_split(
    X_iris, y_iris, test_size=0.3, random_state=42
)

In [None]:
# Train k-NN classifier (k=3)
knn = _____________(n_neighbors=3)
knn.______(X_train_iris, y_train_iris)

In [None]:
# Make predictions and evaluate
y_pred_knn = knn._______(X_test_iris)
accuracy_knn = knn.______(X_test_iris, y_test_iris)
print(f"k-NN Accuracy: {accuracy_knn:.3f}")
print(f"Predictions: {y_pred_knn[:10]}")
print(f"Actual:      {y_test_iris[:10]}")

### Logistic Regression for Classification

Despite its name, logistic regression is actually a classification algorithm! It uses probability curves instead of straight lines.

In [None]:
# Train logistic regression
log_reg = _____________(random_state=42)
log_reg.______(X_train_iris, y_train_iris)

In [None]:
# Evaluate logistic regression
y_pred_log = log_reg._______(X_test_iris)
accuracy_log = log_reg.______(X_test_iris, y_test_iris)
print(f"Logistic Regression Accuracy: {accuracy_log:.3f}")

### Naïve Bayes Classification

Naïve Bayes uses probability theory, assuming features are independent (which is often "naïve" but works well in practice).

In [None]:
# Train Naïve Bayes
nb = _____________()
nb.______(X_train_iris, y_train_iris)

In [None]:
# Evaluate Naïve Bayes
y_pred_nb = nb._______(X_test_iris)
accuracy_nb = nb.______(X_test_iris, y_test_iris)
print(f"Naïve Bayes Accuracy: {accuracy_nb:.3f}")

### Comparing Our Models

Let's see how all three algorithms performed:

In [None]:
# Compare all three models
models = ['k-NN', 'Logistic Regression', 'Naïve Bayes']
accuracies = [accuracy_knn, accuracy_log, accuracy_nb]

plt.figure(figsize=(8, 5))
plt.bar(models, accuracies, color=['blue', 'green', 'orange'])
plt.ylabel('Accuracy')
plt.title('Classification Model Comparison')
plt.ylim(0, 1)
for i, acc in enumerate(accuracies):
    plt.text(i, acc + 0.01, f'{acc:.3f}', ha='center')
plt.show()

## Section 5: Guided Lab Exercise (90-115 min)

**🔬 Your Turn: Practice the Complete Workflow**

Now it's time to practice what you've learned! Work through this exercise step by step.

### Exercise: Wine Quality Prediction

Let's work with a wine quality dataset. Your task is to predict wine quality (classification) using the same workflow we've practiced.

In [None]:
# Load wine dataset
from sklearn.datasets import load_wine
wine = load_wine()
X_wine, y_wine = wine.data, wine.target

print("Wine dataset loaded!")
print(f"Features: {wine.feature_names[:5]}...") # Show first 5
print(f"Classes: {wine.target_names}")
print(f"Shape: {X_wine.shape}")

**Step 1:** Split the wine data into training and testing sets (use 70% for training)

In [None]:
# Step 1: YOUR CODE HERE
# Split the wine dataset using train_test_split
X_train_wine, X_test_wine, y_train_wine, y_test_wine = _____________(
    _____________, _____________, test_size=_____________, random_state=42
)
print(f"Training set: {X_train_wine.shape}")
print(f"Test set: {X_test_wine.shape}")

**Step 2:** Train all three classification models on the wine data

In [None]:
# Step 2: YOUR CODE HERE
# Create and train all three models
wine_knn = _____________(n_neighbors=5)
wine_log = _____________(random_state=42, max_iter=1000)
wine_nb = _____________()

# Train all models
wine_knn.______(_____________, _____________)
wine_log.______(_____________, _____________)
wine_nb.______(_____________, _____________)
print("All models trained!")

**Step 3:** Evaluate and compare the models' performance

In [None]:
# Step 3: YOUR CODE HERE
# Evaluate all models and compare
wine_accuracies = [
    wine_knn.______(_____________, _____________),
    wine_log.______(_____________, _____________),
    wine_nb.______(_____________, _____________)
]

for model, acc in zip(['k-NN', 'Logistic Regression', 'Naïve Bayes'], wine_accuracies):
    print(f"{model}: {acc:.3f}")

## Section 6: Wrap-up and Key Takeaways (115-120 min)

**What We've Accomplished Today**

In this session, you've successfully bridged from mathematical foundations to practical machine learning tools

### Key Concepts Mastered:

1. **Universal ML Workflow**: Load → Split → Fit → Predict → Evaluate
2. **Scikit-learn Consistency**: Same API across all algorithms
3. **Classification vs Regression**: Discrete categories vs continuous values
4. **Three Classification Algorithms**: k-NN, Logistic Regression, Naïve Bayes
5. **Model Comparison**: How to evaluate and compare different approaches

### Next Steps:

- **Practice**: Try these techniques on your own datasets
- **Explore**: Experiment with different parameter values (k in k-NN, etc.)
- **Learn More**: Look into other classification algorithms (Decision Trees, Random Forest, SVM)
- **Real Applications**: Consider how classification applies to your field of interest

**Great work today! You've taken a major step from theory to practice in machine learning! 🚀**