# AI534 Implementation 1
**Deadline**: Sunday, Oct. 12, by 11:59pm

**Submission Instruction**: Submit 1) your completed notebook in ipynb format, and 2) a PDF export of the completed notebook with outputs (the codeblock at the end of the notebook should automatically produce the pdf file).

**Overview** In this assignment, we will implement and experiment with linear regression models to predict house prices based on various features. We will use the same housing data you explored in the warm-up assignment.

We will implement two versions, one using the closed-form solution, and one using gradient descent.

You may modify the starter code as you see fit, including changing the signatures of functions and adding/removing helper functions. However, please make sure that your TA can understand what you are doing and why.

First lets import the necessary packages and configure the notebook environment.

In [1]:
# Install required packages for PDF export (used at the end of the notebook)
# !pip install nbconvert > /dev/null 2>&1
# !pip install pdfkit > /dev/null 2>&1
# !apt-get install -y wkhtmltopdf > /dev/null 2>&1

# Import system and utility libraries
import os
import pdfkit
import contextlib
import sys
# from google.colab import files

# Import data science libraries
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

# add more imports if necessary

# Part 0: (5 pts) Data and preprocessing

---
### Data access
Follow these steps to access the datasets:
1. On Canvas, download the following files:
- `IA1_train.csv` (training data)
- `IA1_val.csv` (validation data)
2. Upload both files to your Google Drive at:
```
/My Drive/AI534/
```
3. Mount Google Drive in Colab using the following code block, which assumes specific file paths for your files.

In [6]:
# from google.colab import drive
# drive.mount('/content/gdrive')

# train_path = '/content/gdrive/My Drive/AI534/IA1_train.csv' # DO NOT MODIFY THIS. Please make sure your data has this exact path
# val_path = '/content/gdrive/My Drive/AI534/IA1_val.csv' # DO NOT MODIFY THIS. Please make sure your data has this exact path

train_path = './IA1_train.csv' # DO NOT MODIFY THIS. Please make sure your data has this exact path
val_path = './IA1_dev.csv' # DO NOT MODIFY THIS. Please make sure your data has this exact path

Now load the training and validation data.

In [12]:
train_df = pd.read_csv(train_path)
val_df = pd.read_csv(val_path)

train_df["date"].dtype

dtype('O')

## 🚧 Preprocessing
Implement the preprocessing function:
1. **Remove** the *ID* column from both training and validation data
2. **Extract date components** Convert the 'date' column into 3 numerical features: 'day', 'month' and 'year'
3. **Create a new feature 'age_since_renovated'** to replace the inconsistent 'yr_renovated'.  is set to 0 if the house has not been renovated. This creates an inconsistent meaning to the numerical values. Replace it with a new feature called *age_since_renovated*:

>if *yr_renovate* != 0
>> *age_since_renovated* = *year* - *yr\_renovated*  

>else
>> *age\_since\_renovated = year - yr\_built*

4. **Normalize features using z-score normalization** (except the target 'price')
For each feature 'x':
$$ z=\frac{x-\mu}{\sigma} $$

where:
 $\mu$ is the mean of 'x' in the training set
 $\sigma$ is the standard deviation of 'x' in the training set

Apply the same $\mu$ and $\sigma$ from the training data to normalize both the training and validation data.




In [None]:
def preprocess(train_df, val_df):
    # Your code goes here
    train_df = train_df.drop(columns="ID")
    val_df = val_df.drop(columns="ID")

    #Process date
    train_df["date"]

    return X_train, X_val, y_train, y_val

Let's do a quick testing of your normalization, please
1. Estimate and print the new mean and standard deviation of the normalized features for the training data --- this should be 0 and 1 respectively.  
2. Estimate and print the new mean and standard deviation of the normalized features for the validation data --- these values will not be 0 and 1, but somewhat close

In [None]:
# Apply preprocessing
X_train, X_val, y_train, y_val = preprocess(train_df, val_df)

# Print training set stats
print("Training set (normalized features):")
print("Mean:", X_train.mean().round(2).to_list())
print("Std: ", X_train.std().round(2).to_list())

# Print validation set stats
print("\nValidation set (normalized features):")
print("Mean:", X_val.mean().round(2).to_list())
print("Std: ", X_val.std().round(2).to_list())



## ✍️ Question
Why is it import to use the same $\mu$ and $\sigma$ to perform normalization on the training and validation data? What would happen if we use $\mu$ and $\sigma$ estimated using the validation to perform normalization on the validation data?  


**Your answer goes here:**

# Part 1 (10 pts) Generate closed-form solution for reference.

Before we implement gradient descent, we’ll begin by solving linear regression using the **closed-form solution** as a reference point.

Our data now contains 21 numeric features. Including the bias term $w_0$, the learned weight vector should have 22 dimensions.



## 🚧 Implement closed-form solution for linear regression
Write a function to compute the weight vector for linear regression using the **closed-form solution** (also known as the normal equation):
$$
\mathbf{w} = (\mathbf{X}^T \mathbf{X})^{-1} \mathbf{X}^T \mathbf{y}
$$

You may use NumPy's build-in matrix operations. For numerical stability, we recommend using `np.linalg.pinv()` when computing the inverse.

Your function should take the feature matrix and target vector as input, and return the learned weight vector \( \mathbf{w} \).

In [None]:
def closed_form_linear_regression(X, y):
    """
    Compute weights for linear regression using the closed-form solution.

    Args:
        X (ndarray): Feature matrix of shape (n_samples, n_features)
        y (ndarray): Target vector of shape (n_samples,)

    Returns:
        w (ndarray): Weight vector of shape (n_features,)
    """
# Your code goes here

## 🚧 Apply and evaluate the model
1. Use your `closed_form_linear_regression()` function to learn weights from the **training data**.
2. Use the learned weights to make predictions on both **training** and **validation** sets.
3. Report the **Mean Squared Error (MSE)** for both sets.
4. Print the learned weight vector (should have 22 values: 21 features + bias).

In [None]:
# Your Code goes here

## ✍️ Question
The learned feature weights are often used  to understand the importance of the features. The sign of the weights indicates if a feature positively or negatively impact the price, and the magnitude suggests the strength of the impact. Does the sign of all the features match your expection based on your common-sense understanding of what makes a house expensive? Please hightlight any surprises from the results.


**Your answer goes here**

# Part 2 (35 pts) Implement and experiment with batch gradient descent

In this part, you will implement batch gradient descent for linear regression and experiment with it on the given data.

## 🚧 Implement 'batch_gradient_descent' function

Your function should take following **inputs:**
- `X` : training feature matrix (shape: n_samples × d)
- `y` : target vector (shape: n_samples)
- `gamma` : learning rate \( \gamma \)
- `T` : number of iterations (epochs)
- `epsilon_loss` *(optional)*: convergence threshold for loss \( \epsilon_l \)
- `epsilon_grad` *(optional)*: convergence threshold for gradient norm \( \epsilon_g \)

It should output:
1. 'w': the learned $d+1$ - dimensional weight vector
2. 'losses': list of mean squared errors for each training iteration

In [None]:
def batch_gradient_descent(X, y, gamma, T, epsilon_loss=None, epsilon_grad=None):
    """
    Perform batch gradient descent for linear regression.

    Args:
        X (ndarray): Feature matrix (n_samples, n_features)
        y (ndarray): Target vector (n_samples,)
        gamma (float): Learning rate
        T (int): Number of iterations (epochs)
        epsilon_loss (float, optional): Convergence threshold for loss
        epsilon_grad (float, optional): Convergence threshold for gradient norm

    Returns:
        w (ndarray): Learned weight vector (d+1, includes bias)
        losses (list): MSE loss at each epoch
    """
    # Your code goes here

## 🚧 Experiment with different learning rate
Use your 'batch_gradient_descent' function to
1. Train models on the training data with learning rates $\gamma = 10^{-i}$ for $i = 0, 1, 2, 3, 4$.
2. Train for up to 3000 iterations (stop early if the loss converges or diverges).
3. For each converging (not necessarily converged yet) learning rate, compute and report the final MSE on the **validation set**.
4. Plot the **training loss curves** (MSE vs. iterations) for all converging learning rates.
   - Use different colors for each learning rate
   - Include a legend

In [None]:
# Your code goes here

## ✍️ Question

Which learning rate leads to the best training and validation MSE respectively? Do you observe better training MSE tend to correpsond to better validation MSE? How is this different from the trend shown on page 52 (or vicinity) of the lecture slides (titled 'danger of using training loss to select M') regarding overfitting? Is there any issue with using training loss to pick learning rate in this case?

**Your answer goes here.**

# Part 3. More exploration.

## **3(a). (25 pts) Normalization of features: what is the impact?**
In part 1, you were asked to perform z-score normalization of all the features. In this part, we will ask you to first conceptually think about what is the impact this operation on the solution and then use some experiments to varify your conceptual understanding.

## ✍️ **Questions.**

The normalization process applies a linear transformation to each feature, where the transformed feature $x'$ is simply a linear function of original feature $x$: $x'=\frac{x-\mu}{\sigma}$.

Let's disect the influence of this transformation on our learned linear regression model.
1. How do you think this transformation will influnce the training and validation MSE we get for the closed-form solution? Why?
2. How do you think this will change the magnitude of the weights of the learned model? Why?
3. How do you think this will change the convergence behavior of the batch gradient descent algorithm? Why?

**Your answer goes here.**

## 🚧Experimental verification
Now please perform the following experiments to verify your answer to the above questions.
1. Apply 'closed_form_linear_regression' to training data that did not go through the feature normalization step, and report the learned weights and the resulting training and testing MSEs.

2. Apply 'batch_gradient_descent' to training data that did not go through the feature normalization step using different learning rates. Note that the learning rate used in previous section will no longer work here. You will need to search for an appropriate learning rate to get some converging behavior. Plot your MSE loss curve as a function of the epochs once you identify a convergent learning rate.
Hint: the learning rate needs to be much, much,much, much, much, much, much smaller (think about each much as an order of manitude) than what was used in part 2). Also unless you let it run for a long time, it is unlikely to converge to the same level of loss values. So use a reasonable upper bound on the # of iterations so that it won't take forever.

In [None]:
# Your code goes here

## ✍️ Questions

Please revisit the questions above. Does your experiment confirm your expectation?  Can you provide explanations to the observed differences (or lack of differences) between the normalized data and unnormalized data? Based on these observations and your understanding of them, please comment on the benefits of normalizing the input features in learning for linear regressions.


**Your answer goes here**

## **3(b). (15 pts) Explore the impact of correlated features**

In the warm up exercise, you all have seen some features are highly correlated with one another. For example, there are multiple squared footage related features that are strongly correlated (e.g., *sqft_above* and *sqrt_living* has a correlation coefficient of 0.878).  This is referred to as multicollinearity phenomeon, where two or more features are correlated.

There are numerous consequences from multicollinearity. It makes it more challenging to estimate the weights of the features accurately. The weights may become unstable, and their interpretation becomes less clear.

In this part you will work with the pre-processed training set, and perform the following experiments to examine how correlated features affect the stability of learned weights.

 ## 🚧Experiment to investigate impact of correlated features
Conduct following experiments.
1. **Create five training subsets**:  Randomly subsample 75% of the orginial preprocessed training set to form five slightly different training sets.
2. **Fit models**:  Use your 'closed_form_linear_regression' function to train a linear regression model on each of the five training sets.
3. **Report learned weights in a table**:  
   - The table should have **five rows** (one for each model)  
   - Each column corresponds to a **feature’s weight**  
   - Include a **header row** with the feature names

4. **Report the variance of weights across models**:  
   Include an additional row to the above table to report for each feature, the variance of its learned weight coefficients across the five models.  This variance serves as a measure of the **stability** of the weight assigned to each feature. Larger variance suggests lower stability.
  
  Note: We use 5 random training subset here to get a rough sense of weight stability. For more robust analysis, you could increase this to 10 or more runs.



In [None]:
# Your code goes here

## ✍️ Questions
Ideally, we want the learned weight coefficients to be **stable across different runs**, as this indicates a more **reliable and interpretable** model.
- Based on the variances you computed:
  - Do features with **high correlation to others** tend to show **more instability** in their weights across different training subsets?
  - What trends do you observe?
- Use a **correlation matrix** of the input features to support your observations. Which features appear most correlated?
- What implications does this have for interpreting feature importance in your model?


**Your answer goes here.**

# Kaggle competition (10 pts)
In this section, you will try to build your best model on the given training data and apply it to the provided test data and submit the predictions for the class-wide competition on Kaggle.

**Model restriction.** You must use linear regression (without regularization) as your predictive model. No advanced models, such as Ridge, Lasso, tree-based models, neural networks, or other complex learners are not allowed.

**Implementation note.** For this part, you are allowed to use a standard library implementation (e.g., 'sklearn.linear_model.LinearRegression') to speed up experimentation.

**Exploration encouraged.** You are encouraged to explore:
- feature engineering such as removing, transforming features, constructing new features based on existing ones, using different encoding for the discrete features;
- training data filtering/modification such as identifying and removing potential outliers in the training data;
- target manipulation such as normalizing, or log transforming the prediction target

**Fair play and have fun!** The spirit of this competition is for you to learn how far linear regression can go when paired with thoughtful data preparation.

To participate in this competition, use the following link:
https://www.kaggle.com/t/7e07d14f327c4ee1babd526d4ccf0701


**Team work.** You should continue working in the same team for this competition. Make sure to note in your submission your kaggle team name.

**How to sumbit.**
Your submission should include the prediction for every test sample. The file must be a CSV with two columns: `id` and `price`.
- `id` is the unique identifier for each instance as provided in the test data PA1_test1.csv  
- `price` is your predicted result.
Your file should start with a header row (`id, price`) and followed by $N$ rows, one per test sample.


**Competition evluation. ** The competition has two leaderboards: the public leader board as well as the private leader board. The results on the public leader board are visible through out the competition so that you can tell how well your model works compared to others and use it to pick the best models to make submission for the private leader board. Each team will be allowed to submit 3 entries to be evaluated on the private leaderboard for the final performance. The results on the private leaderboard will be released after the competition is closed.

**Points and bonus points.** You will get the full 10 points if you
- participate in the competition (successful submissions)

- achieve non-trivial performance (outperform some simple baseline)

- complete the report on the competition below.

You will get **3 nonus points** if your team scored top 3 on the private leader board, or entered the largest number of unique submissions (unique sores).

**No late submission.** The competition will be closed at 11:59 pm of the due date. No late submission will be allowed for this portion of the assignment to ensure fairness.


## ✍️ Report on the Kaggle competition

1. **Team name**:
2. **Exploration Summary:** Brief describe the approches you tried. 3. **Most Impactful Change: ** Which exploration led to the most performance improvement, and why do you think it helped?


In [None]:
#running this code block will convert this notebook and its outputs into a pdf report.
# ⚠️ALERT! Exporting colab notebooks into a clean figure-inclusive pdf can be unreliable.
# Sometimes output figures may not appear in your exported file.
#
#If this happens, please assemble your report mannually: copy relevant fgures/results
# into a separate documents and save as PDF. Be sure to clearly lablel each figure with
# the corresponding part number (e.g., Part 3(b)).).

!jupyter nbconvert --to html /content/gdrive/MyDrive/Colab\ Notebooks/IA1-2024.ipynb  # you might need to change this path to appropriate value to location your copy of the IA0 notebook

input_html = '/content/gdrive/MyDrive/Colab Notebooks/IA1-2025.html' #you might need to change this path accordingly
output_pdf = '/content/gdrive/MyDrive/Colab Notebooks/IA1output.pdf' #you might need to change this path or name accordingly

# Convert HTML to PDF
pdfkit.from_file(input_html, output_pdf)

# Download the generated PDF
files.download(output_pdf)