# Programming Assignment 4 - Simple Linear vs. Ridge Regression


In the historical heart of Boston, Bob seeks to understand the intricacies of the real estate market. With a linear regression model at his side, Bob wonders if he can improve his predictions. Given your expertise in machine learning, he turns to you for guidance. Specifically, he wants to unravel the factors influencing the median value of homes across different Boston neighborhoods.

To assist Bob, you decide to:
*  Implement the closed-form solution for linear regression. 
* Apply a polynomial transformation to increase model flexibility.
* Utilize ridge regression to control model complexity.
* Apply 10-fold cross-validation for more reliable performance estimates.


Bob is curious and wants to see a comparison between linear and ridge regression, both with and without polynomial transformations, on the same dataset. Thus, the challenge begins!

 Variables in order:
* CRIM:     per capita crime rate by town
*  ZN:       proportion of residential land zoned for lots over 25,000 sq.ft.
* INDUS:    proportion of non-retail business acres per town
* CHAS:     Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
* NOX:      nitric oxides concentration (parts per 10 million)
* RM:       average number of rooms per dwelling
* AGE:      proportion of owner-occupied units built prior to 1940
* DIS:      weighted distances to five Boston employment centres
* RAD:      index of accessibility to radial highways
* TAX:      full-value property-tax rate per \$10,000
* PTRATIO:  pupil-teacher ratio by town
* B:        $1000(Bk - 0.63)^2$ where Bk is the proportion of blacks by town
* LSTAT:    \% lower status of the population
* MEDV:     Median value of owner-occupied homes in \$1000's

Note: The Boston Housing dataset, especially the 'B' variable, touches upon serious ethical and societal concerns related to race and inequality. Reflect upon these issues, and consider strategies such as excluding the 'B' column from analyses.

With this context, let's assist Bob in his real estate endeavors!


## 1 Setup and Data Preparation
Import Libraries



In [1]:
import numpy as np  # Fundamental package for linear algebra and multidimensional arrays
import pandas as pd  # Data analysis and manipulation tool

# Transform features to polynomial features for model flexibility
from sklearn.preprocessing import PolynomialFeatures  

# Split arrays or matrices into random train and test subsets
from sklearn.model_selection import train_test_split  

# Scale features to zero mean and unit variance, commonly used for normalization
from sklearn.preprocessing import StandardScaler

# Provides train/test indices to split data into train/test sets while performing cross-validation
from sklearn.model_selection import KFold  


Load the Dataset


In [2]:
# Define feature names
# Specifying the names of the columns in our dataset makes it easier to understand and reference them.
feature_names = ["CRIM", "ZN", "INDUS", "CHAS", "NOX", "RM", "AGE", "RAD", "PTRATIO", "B", "LSTAT", "MEDV"]

# Load the data
# We read data from a CSV (Comma-Separated Values) file into a DataFrame. DataFrame is a 2D labeled data structure in pandas.
filename = 'Boston_housing.csv'
df = pd.read_csv(filename, sep='\s+', header=None)

# Display basic information about the dataset
# It's good practice to inspect the dataset's size and first few rows to ensure it's loaded correctly and understand its structure.
print("Dataset shape:", df.shape)
print(df.head())

# Extract features and target
# Machine learning typically involves using features (independent variables) to predict a target (dependent variable).
# Here, we separate the dataset into features (X) and target (y).
X = np.array(df.iloc[:, :13])  # All columns up to the 13th are features
y = np.array(df.iloc[:, 13]).reshape(-1, 1)  # The 13th column is our target, and we reshape it to a 2D array for compatibility.

# Preview data
# It's also good practice to preview the data after separation to ensure everything looks as expected.
print("\nFirst 5 rows of X:\n", X[:5])
print("First 5 values of y:\n", y[:5])
print("X shape:", X.shape)
print("y shape:", y.shape)


Dataset shape: (506, 14)
        0     1     2   3      4      5     6       7   8      9     10  \
0  0.00632  18.0  2.31   0  0.538  6.575  65.2  4.0900   1  296.0  15.3   
1  0.02731   0.0  7.07   0  0.469  6.421  78.9  4.9671   2  242.0  17.8   
2  0.02729   0.0  7.07   0  0.469  7.185  61.1  4.9671   2  242.0  17.8   
3  0.03237   0.0  2.18   0  0.458  6.998  45.8  6.0622   3  222.0  18.7   
4  0.06905   0.0  2.18   0  0.458  7.147  54.2  6.0622   3  222.0  18.7   

       11    12    13  
0  396.90  4.98  24.0  
1  396.90  9.14  21.6  
2  392.83  4.03  34.7  
3  394.63  2.94  33.4  
4  396.90  5.33  36.2  

First 5 rows of X:
 [[6.3200e-03 1.8000e+01 2.3100e+00 0.0000e+00 5.3800e-01 6.5750e+00
  6.5200e+01 4.0900e+00 1.0000e+00 2.9600e+02 1.5300e+01 3.9690e+02
  4.9800e+00]
 [2.7310e-02 0.0000e+00 7.0700e+00 0.0000e+00 4.6900e-01 6.4210e+00
  7.8900e+01 4.9671e+00 2.0000e+00 2.4200e+02 1.7800e+01 3.9690e+02
  9.1400e+00]
 [2.7290e-02 0.0000e+00 7.0700e+00 0.0000e+00 4.6900e-01 7.

Checking for missing values

After getting the data, it's always a good practice to check for missing values in the dataset. Luckily for us, this dataset has no missing values. Here's how you can verify that:


In [3]:
# 2. Check for Missing Values:
print("Missing values in X:", np.isnan(X).sum())
print("Missing values in y:", np.isnan(y).sum())

Missing values in X: 0
Missing values in y: 0


## Implementing 10-Fold Cross-Validation
With the data now loaded into X and y, your next task is to implement the code to select the optimal regularization and polynomial transformation. Utilize 10-fold cross-validation to assess the various configurations.



## 10-Fold Cross-Validation with Feature Scaling and Polynomial Transformation

Cross-validation is a method to assess the performance of a machine learning model on unseen data by dividing the data into a set number of groups, or "folds".

### Why 10-Fold Cross-Validation?

In 10-fold cross-validation, the dataset is randomly divided into ten parts or folds. The idea is to iteratively train the model on 9 of these folds and test it on the tenth. This is done ten times, once for each fold acting as the validation set. By doing so, we're ensuring that each data point gets to be in a validation set exactly once.

### Feature Scaling Within Cross-Validation

Feature scaling ensures that all features contribute equally to the model performance, which is particularly important for algorithms sensitive to feature magnitudes.

When doing cross-validation, it's crucial that we don't introduce data leakage by scaling using statistics from the entire dataset. Instead:
1. Divide the data into training and validation sets.
2. Fit the scaler on the training set.
3. Apply the scaling to both the training and validation sets using this scaler.

### Polynomial Transformation Within Cross-Validation

Polynomial transformations capture more intricate data relationships by adding polynomial features. Here's how you incorporate it into cross-validation:
1. Divide the data into training and validation sets.
2. Fit the polynomial transformer on the training set.
3. Transform both the training and validation sets using this transformer.
4. Fit the scaler on the transformed training set
4. Apply the scaling to both the transformed training and transformed validation sets using this scaler.

---
### Note on Cross-Validation Error Calculation

In most lecture notes and literature on k-fold cross-validation, the procedure for calculating the cross-validation error typically involves computing the mean of the errors obtained from each fold. However, in the context of our analysis, given the relatively small size of the dataset and the possibility of unequal numbers of samples in each fold, this traditional approach might not be mathematically rigorous.

To address this, our approach for calculating the cross-validation error will deviate slightly from the traditional method. Instead of merely averaging the errors from each fold, we will sum up the errors across all folds and then divide by $ N $, the total number of training examples. This ensures that our error estimate is unbiased and takes into account the potential discrepancy in the number of samples across different folds.

Mathematically, the cross-validation error, $ E_{cv} $, for this assignment is computed as:
$$  E_{\text{cv}} = \frac{1}{N} \sum_{i=1}^{k} \sum_{j \in \text{fold } i} (y^{(j)}- \hat{y}^{(j)})^2
 $$
where $ k $ is the number of folds, $ y^{(j)} $ is the true target value of the $j^{th} $ example, and $ \hat{y}^{(j)} $ is the predicted value for the same example.

---


# Your code goes here

Feel free to add any helper functions you may need.

### Part a) 5-fold Cross Validation using Linear Regression

In [4]:
def linear_regression(X, y):
    # use np.linalg.pinv(a)    
    # Compute the weights using the closed-form solution 
    #### TO-DO #####
    w = np.linalg.pinv(X.T @ X) @ X.T @ y
    
    
    ##############
    return w

 Next implement Squared Error. It measures the average squared difference between the estimated values (predictions) and the actual values (true values). Mathematically, it is represented as: $  \sum_{i=1}^{N} (y^{(i)} - \hat{y}^{(i)})^2 $


In [5]:
def squared_error(y_true, y_pred):    
    #### TO-DO ##### 
    # Calculate the squared differences
    error = (y_true - y_pred) ** 2


    ##############    
    return error

In [None]:
def k_fold_linear_regression(X, y, k=10):
    """
    Perform k-fold cross-validation for linear regression.
    """
    kf = KFold(n_splits=k, random_state=10, shuffle=True)
    #### TO-DO #####
    
    
    for train_index, val_index in kf.split(X):
        X_train, X_val = X[train_index], X[val_index]
        y_train, y_val = y[train_index], y[val_index]
        w = linear_regression(X_train, y_train)
        
        y_pred = w.T @ X
        
        print(y_pred)

        # Fit the model on training data
        
        
        # Calculate in-sample error and cross-validation error

        
    ##############    
    return e_in, e_cv


In [None]:
#Use your code to answer question a)    
#### TO-DO #####
    
    
##############


### Part b) Adding Ridge Regression
Enhance the previous code to include Ridge Regression.

In [None]:
def ridge_regression(X, y, alpha):
    # use np.linalg.pinv(a)    
    # Compute the weights using the closed-form solution 
    #### TO-DO #####
    
    ##############
    return w

In [None]:


def k_fold_ridge_regression(X, y, k=10, lambdas=[1e-4, 1e-3, 1e-2, 1e-1, 1.0]):
    """
    Perform k-fold cross-validation for ridge regression with various lambda values.
    """
    best_alpha = None
    best_error = float('inf')
    
    kf = KFold(n_splits=k, random_state=10, shuffle=True)
    
    for alpha in lambdas:
    #### TO-DO #####

    
    
    

            
    ##############
    return best_alpha, best_error


In [None]:
#Use your code to answer question b)    
#### TO-DO #####
    
    
##############

### Part c) Adding Polynomial Transformations and Ridge Regression
Extend their code to incorporate polynomial transformations combined with Ridge Regression.

In [None]:
from sklearn.preprocessing import PolynomialFeatures

def k_fold_poly_ridge(X, y, k=10, lambdas=[1e-4, 1e-3, 1e-2, 1e-1, 1.0, 2, 3, 4, 5], degrees=[1, 2, 3]):
    """
    Perform k-fold cross-validation for ridge regression with various lambda values and polynomial transformations.
    """
    best_lambda = None
    best_degree = None
    best_error = float('inf')
    
    kf = KFold(n_splits=k, random_state=10, shuffle=True)
    
    for degree in degrees:
        poly = PolynomialFeatures(degree=degree)
        X_poly = poly.fit_transform(X)
    #### TO-DO #####

    
    
    
    
    
    
    
    ##############
    return best_lambda, best_degree, best_error


In [None]:
#Use your code to answer question b)    
#### TO-DO #####
    
    
##############

In [None]:
# additonal code here