# Case 1 (Part I): House price prediction

In this case (Part I), you will build a multilayer perceptron network to predict the selling price of properties. The dataset consists of all single family houses and condos that were sold in Denver in a given year.

You need to submit the following files on canvas site:

- A report in the pdf format containing the plots of the training errors for the multi-layer perception model and the linear regression model, and the answers to the two questions below. You should also provide interpretations and implications of each plot/table in your report. It is not enough to simply put a chart or a table of numbers in the report and expect the audience to understand what the chart means and what it implies. The point is to provide some insights for an audience like senior manager at Zillow.

- The complete Jupyter notebook containing all your Pytorch code with explanations, along with a Markdown text explaining different parts if needed.




---
## Kaggle community competition: Prof. X's Prize


You need to set up a Kaggle account and joined the Kaggle competition by following the [link](https://www.kaggle.com/t/414a77c12150407d97e39fae245e34ef).

- Name your team as Section_X_Team_Y, where X is either A or B or C or D, and Y is your team number.
- One of the team members can serve as team leader and invite other members of your team to join the team.

- Each team can submit at most 20 predictions daily

To get the test error for your model, you need to submit your predicted prices for test data on Kaggle. See Kaggle competition website for more detailed instructions. Note that in Part I of the case, you do not need to worry about optimizing your model to get the lowest error possible. The Part I will be graded based on your implemention of the base models as specified below.  We will come back to optimize the model and compete for Prof. X's Prize in Part II of the case.

---
## Data Loading and Visualize Data


The train data and test data are available on the Kaggle competition website.
You need to first download them, then upload them to the google colab, and then read the data using pandas.

In [None]:
import pandas as pd  # Importing pandas, which is a library for data manipulation and analysis
#Read the datasets
train_df =pd.read_csv("train.csv")
test_df =pd.read_csv("test.csv")

### Visualization of SALE PRICES in train data

Let's take a closer look at the sale prices in the train data.

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt  # Importing matplotlib's pyplot for making plots and charts

# Set the style
sns.set(style="whitegrid")

# Create a histogram
plt.figure(figsize=(10, 6))
sns.histplot(train_df['SALE_PRICE'], bins=50, color='blue')
plt.title('Histogram of Sale Prices (Train Data)')
plt.xlabel('Sale Price')
plt.ylabel('Number of Properties')
plt.show()

---
## Data Preparation

The first step when building a neural network model is getting your data into the proper form to feed into the network.

- **Train labels**: We need to extract the sale prices from the train data as train labels. Since the house prices can take very large values, to make training fast it is helpful to define the train labels as the sale prices divided by a normalization factor.

- **Handing non-numeric features**: Some of the house features are non-numeric. We will learn about how to process categorical data in the upcoming lectures. For now, you can  remove those non-numeric features and only train over the numeric features.

- **Feature standardization**: When predicting house prices, you started from features that took a variety of ranges—some features had small floating-point values, and others had fairly large integer values. The model might be able to automatically adapt to such heterogeneous data, but it would definitely make learning more difficult. A widespread best practice for dealing with such data is to do feature-wise normalization: for each feature in the input data (a column in the input dataframe), we subtract the mean of the feature and divide by the standard deviation, so that the feature is centered around 0 and has
a unit standard deviation. **Note**: We need to ensure that the train and test data go through the same normalization.

- **Handling missing values**: There may exist some entries with missing values. After the feature standardization, we can impute the missing values with zeros.

We see that the sale_price in train data has a wide range from 50K to 2 million, with the median price 431K. We can divide the sale_price by 100K, so the normalized sale_price is between 0.5 and 20 in training data. Remember, when we output the predicted price for the test data, we need to multiply back the normalization factor.

In [None]:
#TODO: define labels for train data as the sale prices divided by $100,000
normalization_factor=100000
train_labels =

In [None]:
train_labels.shape

In [None]:
#TODO: Write code to construct feature vectors for train and test data after data preparation.
train_features =
test_features =

In [None]:
train_features.shape, test_features.shape

Finally, we convert features and labels to PyTorch tensors.

In [None]:
import torch
import numpy as np

# Convert training features and labels to PyTorch tensors
train_features = torch.tensor(train_features.values.astype(np.float32), dtype=torch.float32)
test_features = torch.tensor(test_features.values.astype(np.float32), dtype=torch.float32)
train_labels = torch.tensor(train_labels.values.reshape(-1, 1).astype(np.float32), dtype=torch.float32)

---
## DataLoaders and Batching

After creating training, test, and validation data, we can create DataLoaders for this data by following two steps:
1. Create a known format for accessing our data, using [TensorDataset](https://pytorch.org/docs/stable/data.html#torch.utils.data.TensorDataset) which takes in an input set of data and a target set of data with the same first dimension, and creates a dataset.
2. Create DataLoaders and batch our training, validation, and test Tensor datasets. Note that we will shuffle the train data, so the model will not learn a particular order. For test data, we do not shuffle.

In [None]:
from torch.utils.data import TensorDataset, DataLoader
#  Create DataLoaders and batch our train data
train_data = TensorDataset(train_features, train_labels)
train_loader = DataLoader(train_data, batch_size=128, shuffle=True)

In [None]:
#TODO: Create DataLoaders and batch for test data
test_loader =

Let's take a batch to have a sanity check

In [None]:
# obtain one batch of training data
dataiter = iter(train_loader)
features, labels = next(dataiter)

print('Sample input size: ', features.size()) # batch_size, seq_length
print('Sample input: \n', features)
print()
print('Sample label size: ', labels.size()) # batch_size
print('Sample label: \n', labels)

---
## Linear regression as benchmark

Let us first build a linear regression model as a benchmark.

In [None]:
#TODO: Build a linear regression model network
lin_net =

Let's take a batch and see the output

In [None]:
features, labels = next(dataiter)
output=lin_net(features)
output.shape,labels.shape

## Train the model

First, we will use GPU training if it is availabe.

In [None]:
#TODO: use GPU for training if it is availabe

Second, let us specify the loss function.

In [None]:
#TODO: specify the loss function for training
criterion =

We are now ready to train the network.


Note that with house prices, as with stock prices, we care about relative quantities more than absolute quantities. Thus we tend to care more about the relative error than about the absolute error. For instance, if our prediction is off by \\$100,000 when estimating the sale price of a house which is \\$125,000, then we are probably doing a horrible job. On the other hand, if we err by this amount for a house with sale price \\$2 million, this might represent a pretty  accurate prediction.

To this end, we will use the median error rate (MER) used by [Zestimate](https://www.zillow.com/z/zestimate/) to measure the predictive performance. The error rate is defined as
$$
\text{Error Rate} = \left| \frac{\text{Predicted Price}-\text{Actual Price}}{\text{Actual Price}} \right|
$$
The median error rate is defined as the median of error rates for all properties.

In [None]:
#TODO: Write code to train the network

Plot the training error (MER) over epochs

In [None]:
#TODO: Write code to plot the training error (MER) over epochs

---
## Build the Multi-layer Perceptron Base Model

In the following, we build a multi-layer perception model.

In [None]:
#TODO: Build a multi-layer perception neural network with 2 hidden layers of sizes 256 and 128, respectively and ReLu activations

In [None]:
#TODO: write code to train the MLP network

In [None]:
#TODO: Write code to plot the training error (MER) over epochs

**Question 1**: What are your final training errors of the multilayer perception model and the linear regression model?

---
## Inference on test data

After the MLP model is trained, we can use it for inference.

In [None]:
#TODO: write the code to generate predicted sale prices for test data

In [None]:
#TODO: save the predicted sale prices into submission_csv

Now, we can submit our predictions on Kaggle and see how they compare with the actual house prices (labels) on the test set.

- Log in to the Kaggle website and visit the house price prediction competition page.

- Click the “Submit Predictions”.

- Click the “Browse Files” button in the dashed box at the bottom of the page and select the prediction file you wish to upload.

- Click the “Submit” button at the bottom of the page to view your results.

**Question 2**: What is the test error shown on Kaggle? How does it compare with the train error?