# Homework 1 for CS 329P

**Authors**: xiaoxiao

**Emails**: myname@xiaoxiao.com

**Submission.** Please insert your names and emails above, save your code in this notebook, and explain what you are doing along with your findings in text cells. You can think of it as a technical report with code. Before submission, please use `Kernel -> Restart & Run All` in the Jupyter menu to verify your code is runnable and save all outputs. Afterwards, you can either upload your raw notebook (`hw1.ipynb`) or an exported PDF version to the `Homework 1` assignment in Canvas.


In this homework, we will train a house sales price predictor on the data we scraped previously. The purpose of this homework is to let you practice different techniques that you can use to preprocess raw data. Your job is to obtain the best root mean squared logarithmic error (RMSLE) on the test dataset. To make your job easy, we provide sample code to train a model to report RMSLE and a list of ideas you can explore.

**Note**: You can use either local runtimes to complete this assignment, or a hosted runtime (with GPU) on Colab. The second option generally runs faster. If using a local runtime, make sure that your Python version is less than 3.9 but at least 3.6, or you may have issues installing Autogluon. If using a runtime hosted on Colab, you can use the File Explorer pane on the left to upload the `house_sales.ftr` file. Make sure to wait until the file finishes uploading before running the next code block.

Additionally, if using a local runtime, please refer to the [AG document](https://auto.gluon.ai/stable/index.html#installation) for info on how to install autogluon.

## Prepare Data

Let's first read in the dataset we used in our [Exploratory Data Analysis (EDA)](https://c.d2l.ai/stanford-cs329p/_static/notebooks/cs329p_notebook_eda.slides.html). Note that we use the [`feather` format](https://arrow.apache.org/docs/python/feather.html), which is faster to read than CSV but uses more disk space. The file `home_sales.ftr` can be downloaded from the Assignments folder in Canvas.

Just for your information, it is generated with:

```python
data = pd.read_csv('house_sales.zip', dtype='unicode')
data.to_feather('house_sales.ftr')
```

The following code needs at least 2GB memory. If using a local runtime, please make sure your machine has enough memory.


In [1]:
!pip install numpy pandas autogluon mxnet --upgrade

Collecting numpy
  Using cached numpy-2.1.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (60 kB)


In [3]:
# Run the following line once to install. You may need to restart your runtime afterwards:
# !pip3 install numpy pandas autogluon mxnet --upgrade
import pandas as pd
import numpy as np
import pyarrow
print(pyarrow.__version__)
data = pd.read_feather('./house_sales2.ftr')

17.0.0


In [4]:
data.describe()

Unnamed: 0,Id,Address,Sold Price,Sold On,Summary,Type,Year built,Heating,Cooling,Parking,...,Well Disclosure,remodeled,DOH2,SerialX,Full Baths,Tax Legal Lot Number,Tax Legal Block Number,Tax Legal Tract Number,Building Name,Zip
count,164944,164944,164859,164944,161827,163260,163289,163266,163263,163260,...,1,1.0,1,1,1,1.0,1.0,1,1,164944
unique,164944,161952,11784,1101,159689,317,182,3284,1044,12214,...,1,1.0,1,1,1,1.0,1.0,1,1,1762
top,2080183300,"Zzzz,","$1,200,000",02/26/21,For comp purposes only.,SingleFamily,No Data,No Data,No Data,"Garage, Garage - Attached, Covered",...,Yes,2020.0,TBD,0903-521-14085B,One,39.0,62033.0,Piru-0057,Iron Horse South,95003
freq,1,97,1149,1979,46,102040,14224,37341,52958,28165,...,1,1.0,1,1,1,1.0,1.0,1,1,795


In [5]:
import scipy
import numpy as np
scipy.__version__, np.__version__

('1.12.0', '1.26.4')

We select a few common columns to make our training fast. You need to select more columns to make your model more accurate.

In [6]:
df = data[['Sold Price', 'Sold On', 'Type', 'Year built', 'Bedrooms', 'Bathrooms']].copy()
# uncomment the below line to save memory
# del data

We copy the code from EDA to convert `Sold Price` to numerical values, which is our prediction target. We also remove examples whose prices are too high or too low.

In [7]:
c = 'Sold Price'
print(f"Before: {df.shape}")
if c in df.select_dtypes('object').columns:
    df.loc[:,c] = np.log10(
            pd.to_numeric(df[c].replace(r'[$,-]', '', regex=True)) + 1)
    # 先把一部分数据转换成数字，
    # 把数据取了log，这样消除了数据大于10^8和数据小于 10^4  的房子
df = df[(df['Sold Price'] >= 4 ) & (df['Sold Price'] <= 8 )]
print(f"After: {df.shape}")



Before: (164944, 6)
After: (160839, 6)


We use the house sales between 2021-2-15 and 2021-3-1 as our test data. You can use any example before 2021-2-15, but not after. In other words, we pretend we are launching our model on 2021-2-15 and testing it for 2 weeks. Here we only use sales in 2021 for fast training, but you can use more to improve accuracy.

In [8]:
test_start, test_end = pd.Timestamp(2021, 2, 15), pd.Timestamp(2021, 3, 1)
train_start = pd.Timestamp(2021, 1, 1)
df['Sold On'] = pd.to_datetime(df['Sold On'], errors='coerce')
train = df[(df['Sold On'] >= train_start) & (df['Sold On'] < test_start)]
test = df[(df['Sold On'] >= test_start) & (df['Sold On'] < test_end)]
print(train.shape, test.shape)

(24872, 6) (11510, 6)


  df['Sold On'] = pd.to_datetime(df['Sold On'], errors='coerce')


Define our evaluation metric.

In [9]:
def rmsle(y_hat, y):
    # we already used log prices before, so we only need to compute RMSE
    return sum((y_hat - y)**2 / len(y))**0.5

## AutoGluon Baseline

We provide a baseline model trained by AutoGluon (AG). AG is an automl tool that performs automatic feature engineering, model selections, and ensemble. You are welcome to use any model and tool in achieving the best results possible in your homework. However, we recommend that you reuse the following training code so that you can focus on data preprocessing.

In [10]:
from autogluon.tabular import TabularPredictor

label = 'Sold Price'
predictor = TabularPredictor(label=label).fit(train)

No path specified. Models will be saved in: "AutogluonModels/ag-20240922_035838"
Verbosity: 2 (Standard Logging)
AutoGluon Version:  1.1.1
Python Version:     3.10.12
Operating System:   Linux
Platform Machine:   x86_64
Platform Version:   #1 SMP PREEMPT_DYNAMIC Thu Jun 27 21:05:47 UTC 2024
CPU Count:          2
Memory Avail:       8.54 GB / 12.67 GB (67.4%)
Disk Space Avail:   70.17 GB / 112.64 GB (62.3%)
No presets specified! To achieve strong results with AutoGluon, it is recommended to use the available presets.
	Recommended Presets (For more details refer to https://auto.gluon.ai/stable/tutorials/tabular/tabular-essentials.html#presets):
	presets='best_quality'   : Maximize accuracy. Default time_limit=3600.
	presets='high_quality'   : Strong accuracy with fast inference speed. Default time_limit=3600.
	presets='good_quality'   : Good accuracy with very fast inference speed. Default time_limit=3600.
	presets='medium_quality' : Fast training time, ideal for initial prototyping.
  y

Test the performance of each model.

In [11]:
predictor.leaderboard(test, silent=True)

  y_internal = y_internal.fillna(-1)


Unnamed: 0,model,score_test,score_val,eval_metric,pred_time_test,pred_time_val,fit_time,pred_time_test_marginal,pred_time_val_marginal,fit_time_marginal,stack_level,can_infer,fit_order
0,LightGBMXT,-0.264573,-0.287632,root_mean_squared_error,1.002475,0.265963,4.641302,1.002475,0.265963,4.641302,1,True,3
1,NeuralNetFastAI,-0.267545,-0.290652,root_mean_squared_error,0.220935,0.038024,25.598358,0.220935,0.038024,25.598358,1,True,8
2,LightGBMLarge,-0.268721,-0.288314,root_mean_squared_error,0.194245,0.034438,1.346797,0.194245,0.034438,1.346797,1,True,11
3,LightGBM,-0.270316,-0.288493,root_mean_squared_error,0.618771,0.15753,2.2378,0.618771,0.15753,2.2378,1,True,4
4,NeuralNetTorch,-0.275444,-0.289182,root_mean_squared_error,0.0661,0.242862,101.686823,0.0661,0.242862,101.686823,1,True,10
5,ExtraTreesMSE,-0.281582,-0.304336,root_mean_squared_error,0.785299,0.24085,16.427803,0.785299,0.24085,16.427803,1,True,7
6,WeightedEnsemble_L2,-0.283026,-0.2844,root_mean_squared_error,1.214507,0.547308,160.154313,0.003376,0.000449,0.017483,2,True,12
7,CatBoost,-0.304812,-0.285888,root_mean_squared_error,0.180149,0.022505,37.681603,0.180149,0.022505,37.681603,1,True,6
8,XGBoost,-0.340308,-0.287954,root_mean_squared_error,0.179584,0.040642,4.340601,0.179584,0.040642,4.340601,1,True,9
9,RandomForestMSE,-0.349453,-0.307901,root_mean_squared_error,1.776107,0.268424,29.063563,1.776107,0.268424,29.063563,1,True,5


In [None]:
predictor.leaderboard(test, silent=True)

  y_internal = y_internal.fillna(-1)


Unnamed: 0,model,score_test,score_val,eval_metric,pred_time_test,pred_time_val,fit_time,pred_time_test_marginal,pred_time_val_marginal,fit_time_marginal,stack_level,can_infer,fit_order
0,LightGBMXT,-0.264573,-0.287632,root_mean_squared_error,1.002475,0.265963,4.641302,1.002475,0.265963,4.641302,1,True,3
1,NeuralNetFastAI,-0.267545,-0.290652,root_mean_squared_error,0.220935,0.038024,25.598358,0.220935,0.038024,25.598358,1,True,8
2,LightGBMLarge,-0.268721,-0.288314,root_mean_squared_error,0.194245,0.034438,1.346797,0.194245,0.034438,1.346797,1,True,11
3,LightGBM,-0.270316,-0.288493,root_mean_squared_error,0.618771,0.15753,2.2378,0.618771,0.15753,2.2378,1,True,4
4,NeuralNetTorch,-0.275444,-0.289182,root_mean_squared_error,0.0661,0.242862,101.686823,0.0661,0.242862,101.686823,1,True,10
5,ExtraTreesMSE,-0.281582,-0.304336,root_mean_squared_error,0.785299,0.24085,16.427803,0.785299,0.24085,16.427803,1,True,7
6,WeightedEnsemble_L2,-0.283026,-0.2844,root_mean_squared_error,1.214507,0.547308,160.154313,0.003376,0.000449,0.017483,2,True,12
7,CatBoost,-0.304812,-0.285888,root_mean_squared_error,0.180149,0.022505,37.681603,0.180149,0.022505,37.681603,1,True,6
8,XGBoost,-0.340308,-0.287954,root_mean_squared_error,0.179584,0.040642,4.340601,0.179584,0.040642,4.340601,1,True,9
9,RandomForestMSE,-0.349453,-0.307901,root_mean_squared_error,1.776107,0.268424,29.063563,1.776107,0.268424,29.063563,1,True,5


Next, we compute the importance of each feature, along with several other metrics. It loooks like the `Sold On` feature is not very useful, likely because the houses in the test data were all sold late. You can choose to either remove such a feature, or find a way to extract a more useful presentation from it.

In [12]:
predictor.feature_importance(test)

Computing feature importance via permutation shuffling for 5 features using 5000 rows with 5 shuffle sets...
	35.43s	= Expected runtime (7.09s per shuffle set)
	20.44s	= Actual runtime (Completed 5 of 5 shuffle sets)


Unnamed: 0,importance,stddev,p_value,n,p99_high,p99_low
Bathrooms,0.08046,0.003255,3.205901e-07,5,0.087161,0.073758
Type,0.079314,0.005775,3.348176e-06,5,0.091204,0.067424
Year built,0.061498,0.002366,2.62567e-07,5,0.066371,0.056626
Bedrooms,0.013502,0.000738,1.068519e-06,5,0.015022,0.011982
Sold On,0.000284,0.000446,0.1140565,5,0.001202,-0.000635


Finally, let's predict and evaluate the RMSLE.

In [13]:
preds = predictor.predict(test.drop(columns=[label]))
rmsle(preds, test[label])

0.28302552009742815

## Your Solution

Please include your solution in the following section. (You are welcome to edit and delete code in previous sections).

Your goal is to train a model using the features in the original dataset that minimizes the RMSLE on the validation dataset. While the naïve model achieves an RMSLE of ~0.3, it is possible to achieve an RMSLE of less than 0.08 on the same dataset.

Here is a list of ideas you could explore:

- More features: We only selected a small set of columns to use in training. You can add more, especially the ones we examined in EDA.
- Data type conversion: Most data columns are strings; you may need to convert them into numerical values.
- Data cleaning: There are NAN and outliers sprinkled throughout the dataset. You should find ways to selectively filter and remove them.
- More examples: We only included sales made in 2021; there is a large number of examples in previous years that you can also include.

In [None]:
# YOUR SOLUTION HERE

FIN