# Lab 1: Introduction to Machine Learning

Welcome to your first machine learning lab! In this lab, we'll explore the fundamentals of machine learning, starting with one of the simplest yet most powerful algorithms: linear regression.

## Learning Objectives

By the end of this lab, you will:
- Understand the basic types of machine learning
- Implement linear regression from scratch
- Learn about gradient descent optimization
- Explore polynomial regression and feature engineering
- Use scikit-learn for practical ML tasks
- Build a housing price prediction model

## What is Machine Learning?

Machine learning is the field of study that gives computers the ability to learn without being explicitly programmed. Instead of writing rules, we provide data and let algorithms discover patterns.

### Types of Machine Learning

1. **Supervised Learning**: Learning from labeled examples
   - Regression: Predicting continuous values
   - Classification: Predicting discrete categories

2. **Unsupervised Learning**: Finding patterns in unlabeled data
   - Clustering: Grouping similar examples
   - Dimensionality reduction: Simplifying data

3. **Reinforcement Learning**: Learning through interaction
   - Agent learns by trial and error
   - Maximizes cumulative reward

This week focuses on **supervised learning**.

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from typing import Tuple, List
import pandas as pd
from sklearn.datasets import make_regression, fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression as SKLinearRegression
from sklearn.metrics import mean_squared_error, r2_score

# Set style
sns.set_style('whitegrid')
plt.rcParams['figure.figsize'] = (12, 6)

## Part 1: Linear Regression from Scratch

Linear regression finds the best-fitting straight line through data points. The model is:

$$y = w_0 + w_1 x_1 + w_2 x_2 + ... + w_n x_n$$

Or in vector notation: $y = \mathbf{w}^T \mathbf{x} + b$

Where:
- $y$ is the predicted output
- $\mathbf{x}$ are the input features
- $\mathbf{w}$ are the weights (parameters)
- $b$ is the bias (intercept)

### Loss Function

We measure prediction error using Mean Squared Error (MSE):

$$MSE = \frac{1}{m} \sum_{i=1}^{m} (y_i - \hat{y}_i)^2$$

Where $m$ is the number of training examples.

In [None]:
class LinearRegression:
    """
    Linear Regression using Gradient Descent.
    
    Parameters:
    -----------
    learning_rate : float
        Step size for gradient descent
    n_iterations : int
        Number of training iterations
    """
    
    def __init__(self, learning_rate: float = 0.01, n_iterations: int = 1000):
        self.learning_rate = learning_rate
        self.n_iterations = n_iterations
        self.weights = None
        self.bias = None
        self.loss_history = []
    
    def fit(self, X: np.ndarray, y: np.ndarray):
        """
        Train the linear regression model.
        
        Parameters:
        -----------
        X : np.ndarray, shape (m, n)
            Training features
        y : np.ndarray, shape (m,)
            Target values
        """
        m, n = X.shape
        
        # Initialize parameters
        self.weights = np.zeros(n)
        self.bias = 0
        
        # Gradient descent
        for i in range(self.n_iterations):
            # Forward pass
            y_pred = self.predict(X)
            
            # Compute loss
            loss = np.mean((y - y_pred) ** 2)
            self.loss_history.append(loss)
            
            # Compute gradients
            dw = -(2/m) * X.T.dot(y - y_pred)
            db = -(2/m) * np.sum(y - y_pred)
            
            # Update parameters
            self.weights -= self.learning_rate * dw
            self.bias -= self.learning_rate * db
    
    def predict(self, X: np.ndarray) -> np.ndarray:
        """
        Make predictions.
        
        Parameters:
        -----------
        X : np.ndarray, shape (m, n)
            Input features
            
        Returns:
        --------
        y_pred : np.ndarray, shape (m,)
            Predictions
        """
        return X.dot(self.weights) + self.bias

### Testing Linear Regression

Let's test our implementation on synthetic data.

In [None]:
# Generate synthetic data
np.random.seed(42)
X_train, y_train = make_regression(n_samples=100, n_features=1, noise=20, random_state=42)
X_test, y_test = make_regression(n_samples=30, n_features=1, noise=20, random_state=43)

# Train model
model = LinearRegression(learning_rate=0.01, n_iterations=1000)
model.fit(X_train, y_train)

# Make predictions
y_pred_train = model.predict(X_train)
y_pred_test = model.predict(X_test)

# Visualize results
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Plot 1: Data and fit
axes[0].scatter(X_train, y_train, alpha=0.6, label='Training data')
axes[0].scatter(X_test, y_test, alpha=0.6, color='orange', label='Test data')
axes[0].plot(X_train, y_pred_train, 'r-', label='Fitted line', linewidth=2)
axes[0].set_xlabel('X')
axes[0].set_ylabel('y')
axes[0].set_title('Linear Regression Fit')
axes[0].legend()
axes[0].grid(True, alpha=0.3)

# Plot 2: Loss history
axes[1].plot(model.loss_history)
axes[1].set_xlabel('Iteration')
axes[1].set_ylabel('MSE Loss')
axes[1].set_title('Training Loss Over Time')
axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print(f"Learned weights: {model.weights}")
print(f"Learned bias: {model.bias:.2f}")
print(f"\nTraining MSE: {mean_squared_error(y_train, y_pred_train):.2f}")
print(f"Test MSE: {mean_squared_error(y_test, y_pred_test):.2f}")
print(f"Test R² Score: {r2_score(y_test, y_pred_test):.3f}")

## Part 2: Closed-Form Solution

Linear regression has a closed-form solution called the **Normal Equation**:

$$\mathbf{w} = (\mathbf{X}^T \mathbf{X})^{-1} \mathbf{X}^T \mathbf{y}$$

This directly computes the optimal weights without iteration.

### Gradient Descent vs Normal Equation

| Aspect | Gradient Descent | Normal Equation |
|--------|-----------------|------------------|
| Speed | Slow for large datasets | Fast for small datasets |
| Memory | Low | High (needs to compute $X^T X$) |
| Features | Works with many features | Slow with many features |
| Generalization | Works for other models | Only for linear regression |

In [None]:
class LinearRegressionNormalEq:
    """
    Linear Regression using Normal Equation (closed-form solution).
    """
    
    def __init__(self):
        self.weights = None
        self.bias = None
    
    def fit(self, X: np.ndarray, y: np.ndarray):
        """
        Train using normal equation.
        """
        # Add bias term (column of ones)
        X_b = np.c_[np.ones((X.shape[0], 1)), X]
        
        # Normal equation: theta = (X^T X)^-1 X^T y
        theta = np.linalg.inv(X_b.T.dot(X_b)).dot(X_b.T).dot(y)
        
        self.bias = theta[0]
        self.weights = theta[1:]
    
    def predict(self, X: np.ndarray) -> np.ndarray:
        """
        Make predictions.
        """
        return X.dot(self.weights) + self.bias

# Compare with gradient descent
model_normal = LinearRegressionNormalEq()
model_normal.fit(X_train, y_train)
y_pred_normal = model_normal.predict(X_test)

print("Normal Equation Results:")
print(f"Weights: {model_normal.weights}")
print(f"Bias: {model_normal.bias:.2f}")
print(f"Test MSE: {mean_squared_error(y_test, y_pred_normal):.2f}")
print(f"Test R² Score: {r2_score(y_test, y_pred_normal):.3f}")

print("\nGradient Descent Results:")
print(f"Weights: {model.weights}")
print(f"Bias: {model.bias:.2f}")
print(f"Test MSE: {mean_squared_error(y_test, y_pred_test):.2f}")
print(f"Test R² Score: {r2_score(y_test, y_pred_test):.3f}")

## Part 3: Polynomial Regression

Sometimes data doesn't follow a straight line. We can fit non-linear patterns by using **polynomial features**.

For example, with $x$, we can create:
- Linear: $y = w_0 + w_1 x$
- Quadratic: $y = w_0 + w_1 x + w_2 x^2$
- Cubic: $y = w_0 + w_1 x + w_2 x^2 + w_3 x^3$

This is still linear regression (linear in the parameters), but with transformed features.

In [None]:
def create_polynomial_features(X: np.ndarray, degree: int) -> np.ndarray:
    """
    Create polynomial features up to the specified degree.
    
    Parameters:
    -----------
    X : np.ndarray, shape (m, 1)
        Input features
    degree : int
        Maximum polynomial degree
        
    Returns:
    --------
    X_poly : np.ndarray, shape (m, degree)
        Polynomial features [x, x^2, x^3, ..., x^degree]
    """
    m = X.shape[0]
    X_poly = np.zeros((m, degree))
    
    for i in range(1, degree + 1):
        X_poly[:, i-1] = (X[:, 0] ** i)
    
    return X_poly

# Generate non-linear data
np.random.seed(42)
X_nonlinear = np.linspace(-3, 3, 100).reshape(-1, 1)
y_nonlinear = 0.5 * X_nonlinear**3 - 2 * X_nonlinear**2 + X_nonlinear + 5 + np.random.randn(100, 1) * 3
y_nonlinear = y_nonlinear.ravel()

# Fit models with different polynomial degrees
degrees = [1, 2, 3, 5, 10]
fig, axes = plt.subplots(2, 3, figsize=(15, 10))
axes = axes.ravel()

for idx, degree in enumerate(degrees):
    # Create polynomial features
    X_poly = create_polynomial_features(X_nonlinear, degree)
    
    # Fit model
    model_poly = LinearRegressionNormalEq()
    model_poly.fit(X_poly, y_nonlinear)
    y_pred_poly = model_poly.predict(X_poly)
    
    # Calculate metrics
    mse = mean_squared_error(y_nonlinear, y_pred_poly)
    r2 = r2_score(y_nonlinear, y_pred_poly)
    
    # Plot
    axes[idx].scatter(X_nonlinear, y_nonlinear, alpha=0.5, s=20)
    axes[idx].plot(X_nonlinear, y_pred_poly, 'r-', linewidth=2)
    axes[idx].set_title(f'Degree {degree}\nMSE: {mse:.2f}, R²: {r2:.3f}')
    axes[idx].set_xlabel('X')
    axes[idx].set_ylabel('y')
    axes[idx].grid(True, alpha=0.3)

# Remove extra subplot
fig.delaxes(axes[-1])

plt.tight_layout()
plt.show()

print("Notice how higher-degree polynomials fit the training data better,")
print("but may overfit (especially degree 10, which wiggles too much).")

## Part 4: Feature Scaling

When features have different scales (e.g., age in years vs income in dollars), gradient descent can be slow. Feature scaling helps:

1. **Standardization (Z-score normalization)**:
   $$x' = \frac{x - \mu}{\sigma}$$
   
2. **Min-Max Normalization**:
   $$x' = \frac{x - x_{min}}{x_{max} - x_{min}}$$

In [None]:
class FeatureScaler:
    """
    Standardize features by removing mean and scaling to unit variance.
    """
    
    def __init__(self):
        self.mean = None
        self.std = None
    
    def fit(self, X: np.ndarray):
        """
        Compute mean and standard deviation.
        """
        self.mean = np.mean(X, axis=0)
        self.std = np.std(X, axis=0)
    
    def transform(self, X: np.ndarray) -> np.ndarray:
        """
        Standardize features.
        """
        return (X - self.mean) / (self.std + 1e-8)  # Add epsilon to avoid division by zero
    
    def fit_transform(self, X: np.ndarray) -> np.ndarray:
        """
        Fit and transform in one step.
        """
        self.fit(X)
        return self.transform(X)

# Generate data with different scales
np.random.seed(42)
X_unscaled = np.random.randn(100, 2)
X_unscaled[:, 0] = X_unscaled[:, 0] * 1000  # Large scale
X_unscaled[:, 1] = X_unscaled[:, 1] * 0.01  # Small scale
y = 3 * X_unscaled[:, 0] + 5 * X_unscaled[:, 1] + np.random.randn(100) * 10

# Train without scaling
model_unscaled = LinearRegression(learning_rate=0.00001, n_iterations=1000)
model_unscaled.fit(X_unscaled, y)

# Train with scaling
scaler = FeatureScaler()
X_scaled = scaler.fit_transform(X_unscaled)
model_scaled = LinearRegression(learning_rate=0.01, n_iterations=1000)
model_scaled.fit(X_scaled, y)

# Compare convergence
plt.figure(figsize=(12, 5))

plt.subplot(1, 2, 1)
plt.plot(model_unscaled.loss_history)
plt.xlabel('Iteration')
plt.ylabel('MSE Loss')
plt.title('Without Feature Scaling\n(Learning Rate: 0.00001)')
plt.grid(True, alpha=0.3)

plt.subplot(1, 2, 2)
plt.plot(model_scaled.loss_history)
plt.xlabel('Iteration')
plt.ylabel('MSE Loss')
plt.title('With Feature Scaling\n(Learning Rate: 0.01)')
plt.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print("Feature scaling allows us to use a larger learning rate and converge faster!")

## Part 5: Using Scikit-Learn

While implementing algorithms from scratch is educational, in practice we use well-tested libraries like scikit-learn.

In [None]:
from sklearn.preprocessing import PolynomialFeatures
from sklearn.pipeline import Pipeline

# Generate data
np.random.seed(42)
X = np.linspace(-3, 3, 100).reshape(-1, 1)
y = 0.5 * X**2 + X + 2 + np.random.randn(100, 1) * 0.5
y = y.ravel()

# Create a pipeline
model_pipeline = Pipeline([
    ('poly_features', PolynomialFeatures(degree=2)),
    ('linear_regression', SKLinearRegression())
])

# Fit and predict
model_pipeline.fit(X, y)
y_pred = model_pipeline.predict(X)

# Plot
plt.figure(figsize=(10, 6))
plt.scatter(X, y, alpha=0.6, label='Data')
plt.plot(X, y_pred, 'r-', linewidth=2, label='Polynomial Fit')
plt.xlabel('X')
plt.ylabel('y')
plt.title('Polynomial Regression with Scikit-Learn')
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()

print(f"R² Score: {r2_score(y, y_pred):.3f}")
print(f"MSE: {mean_squared_error(y, y_pred):.3f}")

## Part 6: Real-World Application - Housing Price Prediction

Let's apply what we've learned to predict housing prices using the California Housing dataset.

In [None]:
# Load California housing dataset
housing = fetch_california_housing()
X_housing = housing.data
y_housing = housing.target
feature_names = housing.feature_names

print("Dataset Information:")
print(f"Number of samples: {X_housing.shape[0]}")
print(f"Number of features: {X_housing.shape[1]}")
print(f"\nFeatures: {feature_names}")
print(f"\nTarget: Median house value (in $100,000s)")

# Create DataFrame for easier exploration
df = pd.DataFrame(X_housing, columns=feature_names)
df['MedianHouseValue'] = y_housing

print("\nFirst few rows:")
print(df.head())

print("\nStatistical summary:")
print(df.describe())

In [None]:
# Visualize feature distributions
fig, axes = plt.subplots(3, 3, figsize=(15, 12))
axes = axes.ravel()

for idx, col in enumerate(feature_names):
    axes[idx].hist(df[col], bins=50, edgecolor='black', alpha=0.7)
    axes[idx].set_title(col)
    axes[idx].set_xlabel('Value')
    axes[idx].set_ylabel('Frequency')
    axes[idx].grid(True, alpha=0.3)

# Target distribution
axes[-1].hist(y_housing, bins=50, edgecolor='black', alpha=0.7, color='red')
axes[-1].set_title('MedianHouseValue')
axes[-1].set_xlabel('Value ($100k)')
axes[-1].set_ylabel('Frequency')
axes[-1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

In [None]:
# Feature correlation with target
correlations = df.corr()['MedianHouseValue'].sort_values(ascending=False)

plt.figure(figsize=(10, 6))
correlations[1:].plot(kind='barh')
plt.xlabel('Correlation with Median House Value')
plt.title('Feature Correlations')
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

print("Correlations with target:")
print(correlations)

In [None]:
# Train-test split
X_train, X_test, y_train, y_test = train_test_split(
    X_housing, y_housing, test_size=0.2, random_state=42
)

# Feature scaling
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Train model
model_housing = SKLinearRegression()
model_housing.fit(X_train_scaled, y_train)

# Predictions
y_train_pred = model_housing.predict(X_train_scaled)
y_test_pred = model_housing.predict(X_test_scaled)

# Evaluate
print("Model Performance:")
print(f"\nTraining Set:")
print(f"  MSE: {mean_squared_error(y_train, y_train_pred):.3f}")
print(f"  RMSE: {np.sqrt(mean_squared_error(y_train, y_train_pred)):.3f}")
print(f"  R² Score: {r2_score(y_train, y_train_pred):.3f}")

print(f"\nTest Set:")
print(f"  MSE: {mean_squared_error(y_test, y_test_pred):.3f}")
print(f"  RMSE: {np.sqrt(mean_squared_error(y_test, y_test_pred)):.3f}")
print(f"  R² Score: {r2_score(y_test, y_test_pred):.3f}")

print("\nNote: RMSE is in units of $100k, so 0.7 means ~$70,000 average error")

In [None]:
# Visualize predictions
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Scatter plot: Actual vs Predicted
axes[0].scatter(y_test, y_test_pred, alpha=0.5)
axes[0].plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], 'r--', lw=2)
axes[0].set_xlabel('Actual Price ($100k)')
axes[0].set_ylabel('Predicted Price ($100k)')
axes[0].set_title('Actual vs Predicted Prices')
axes[0].grid(True, alpha=0.3)

# Residual plot
residuals = y_test - y_test_pred
axes[1].scatter(y_test_pred, residuals, alpha=0.5)
axes[1].axhline(y=0, color='r', linestyle='--', lw=2)
axes[1].set_xlabel('Predicted Price ($100k)')
axes[1].set_ylabel('Residuals')
axes[1].set_title('Residual Plot')
axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

In [None]:
# Feature importance (coefficients)
feature_importance = pd.DataFrame({
    'feature': feature_names,
    'coefficient': model_housing.coef_
}).sort_values('coefficient', key=abs, ascending=False)

plt.figure(figsize=(10, 6))
plt.barh(feature_importance['feature'], feature_importance['coefficient'])
plt.xlabel('Coefficient Value')
plt.title('Feature Importance (Linear Regression Coefficients)')
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

print("Feature Importance:")
print(feature_importance)

## Key Takeaways

1. **Linear regression** models linear relationships between features and targets
2. **Gradient descent** iteratively updates parameters to minimize loss
3. **Normal equation** provides a direct solution but is computationally expensive for large datasets
4. **Polynomial features** allow linear models to fit non-linear patterns
5. **Feature scaling** is crucial for gradient descent convergence
6. **Evaluation metrics** (MSE, RMSE, R²) help assess model performance
7. Real-world datasets require **preprocessing** and **feature engineering**

## Exercises

1. **Learning Rate Experiment**: Try different learning rates (0.001, 0.01, 0.1, 1.0) and observe convergence. What happens when the learning rate is too large?

2. **Polynomial Degree Selection**: For the non-linear data, which polynomial degree gives the best test performance? How do you detect overfitting?

3. **Feature Engineering**: Create new features for the housing dataset (e.g., rooms_per_household = AveRooms / AveOccup). Does this improve performance?

4. **Regularization Preview**: Add L2 regularization to prevent overfitting:
   - Modify the loss: $MSE + \lambda \sum w_i^2$
   - Update the gradient: $dw = -(2/m) X^T(y - \hat{y}) + 2\lambda w$

5. **Mini-batch Gradient Descent**: Implement mini-batch GD that updates weights using random subsets of data instead of the full dataset.

6. **Multi-output Regression**: Extend LinearRegression to predict multiple targets simultaneously.

## Next Steps

In the next lab, we'll explore:
- Classification algorithms (Logistic Regression, k-NN, Decision Trees)
- Different types of supervised learning problems
- More complex decision boundaries
- Algorithm comparison and selection

Great job completing Lab 1! You now understand the fundamentals of machine learning and regression.