# Homework 1

**Your name:** Arianna Bunnell

**A-Number:** A02213719

In this homework, we will build a model based on real house sale data from a [Kaggle competition](https://www.kaggle.com/c/house-prices-advanced-regression-techniques). This notebook contains codes to download the dataset, build and train a baseline model, and save the results in the submission format. Your jobs 

1.   Implement the preprocessing code

2.   Developing a better model to reduce the prediction error. You can try any models you know.

3.   Submitting your results into Kaggle and take a sceenshot of your score. Then replace the following image URL with your screenshot.

![](score.png)

4.   Submit the .IPYNB file to Canvas.
    - Missing the output after execution may hurt your grade.

## Accessing and Reading Data Sets

The competition data is separated into training and test sets. Each record includes the property values of the house and attributes such as street type, year of construction, roof type, basement condition. The data includes multiple datatypes, including integers (year of construction), discrete labels (roof type), floating point numbers, etc.; Some data is missing and is thus labeled 'na'. The price of each house, namely the label, is only included in the training data set (it's a competition after all). The 'Data' tab on the competition tab has links to download the data.

We will read and process the data using `pandas`, an [efficient data analysis toolkit](http://pandas.pydata.org/pandas-docs/stable/). Make sure you have `pandas` installed for the experiments in this section.

In [2]:
# If pandas is not installed, please uncomment and run the following line:
 #!pip install pandas

In [1]:
%matplotlib inline
import numpy as np
import pandas as pd

We downloaded the data into the current directory. To load the two CSV (Comma Separated Values) files containing training and test data respectively we use Pandas.

In [2]:
!wget https://raw.githubusercontent.com/d2l-ai/data/master/kaggle_house_pred_test.csv
!wget https://raw.githubusercontent.com/d2l-ai/data/master/kaggle_house_pred_train.csv

--2020-09-05 12:32:41--  https://raw.githubusercontent.com/d2l-ai/data/master/kaggle_house_pred_test.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 151.101.196.133
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|151.101.196.133|:443... connected.
ERROR: cannot verify raw.githubusercontent.com's certificate, issued by 'CN=DigiCert SHA2 High Assurance Server CA,OU=www.digicert.com,O=DigiCert Inc,C=US':
  Unable to locally verify the issuer's authority.
To connect to raw.githubusercontent.com insecurely, use `--no-check-certificate'.
--2020-09-05 12:32:41--  https://raw.githubusercontent.com/d2l-ai/data/master/kaggle_house_pred_train.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 151.101.196.133
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|151.101.196.133|:443... connected.
ERROR: cannot verify raw.githubusercontent.com's certificate, issued by 'CN=DigiCert SHA2 High Assurance Server CA,OU=www.digic

In [4]:
train_data = pd.read_csv('kaggle_house_pred_train.csv')
test_data = pd.read_csv('kaggle_house_pred_test.csv')

The training data set includes 1,460 examples, 80 features, and 1 label., the test data contains 1,459 examples and 80 features.

In [5]:
print(train_data.shape)
print(test_data.shape)

(1460, 81)
(1459, 80)


Let’s take a look at the first 4 and last 2 features as well as the label (SalePrice) from the first 4 examples:

In [6]:
train_data.iloc[0:4, [0, 1, 2, 3, -3, -2, -1]]

Unnamed: 0,Id,MSSubClass,MSZoning,LotFrontage,SaleType,SaleCondition,SalePrice
0,1,60,RL,65.0,WD,Normal,208500
1,2,20,RL,80.0,WD,Normal,181500
2,3,60,RL,68.0,WD,Normal,223500
3,4,70,RL,60.0,WD,Abnorml,140000


We can see that in each example, the first feature is the ID. This helps the model identify each training example. While this is convenient, it doesn't carry any information for prediction purposes. Hence we remove it from the dataset before feeding the data into the network.

In [81]:
all_features = pd.concat((train_data.iloc[:, 1:-1], test_data.iloc[:, 1:]))

## Data Preprocessing

As stated above, we have a wide variety of datatypes. Before we feed it into a deep network we need to perform some amount of processing. Let's start with the numerical features. We begin by replacing missing values with the mean. This is a reasonable strategy if features are missing at random. To adjust them to a common scale we rescale them to zero mean and unit variance. This is accomplished as follows:

$$x \leftarrow \frac{x - \mu}{\sigma}$$

To check that this transforms $x$ to data with zero mean and unit variance simply calculate $\mathbf{E}[(x-\mu)/\sigma] = (\mu - \mu)/\sigma = 0$. To check the variance we use $\mathbf{E}[(x-\mu)^2] = \sigma^2$ and thus the transformed variable has unit variance. The reason for 'normalizing' the data is that it brings all features to the same order of magnitude. After all, we do not know *a priori* which features are likely to be relevant. Hence it makes sense to treat them equally.

In [82]:
# We are going to first select all of the columns with integer or float 
numerical_cols = all_features.select_dtypes(include=['int64', 'float64'])

#However, the following are actually classes, so we are going to remove them from our normalization 
numerical_cols = numerical_cols.drop(['MSSubClass', 'OverallQual', 'OverallCond', 'MoSold'], axis=1)

#Now perform the normalization only to the true numerical columns 
all_features = all_features.apply(lambda x: (x-x.mean())/x.std() if x.name in numerical_cols.columns else x)

In [83]:
# after standardizing the data all means vanish, hence we can set missing values to 0
all_features = all_features.fillna(0)

Next we deal with discrete values. This includes variables such as 'MSZoning'. We replace them by a one-hot encoding in the same manner as how we transformed multiclass classification data into a vector of $0$ and $1$. For instance, 'MSZoning' assumes the values 'RL' and 'RM'. They map into vectors $(1,0)$ and $(0,1)$ respectively. Pandas does this automatically for us.

In [84]:
## Let's identify the class variables, these are going to be all the variables which weren't included in the numerical
#columns which we found before and normalized. Also, we don't want to transform the index
categorical_cols = list(numerical_cols.columns)
all_features.index.name = 'Index'
categorical_cols.append(all_features.index.name)
categorical_cols = np.setdiff1d(list(all_features.columns),categorical_cols)

all_features = pd.get_dummies(data=all_features, columns=categorical_cols)


In [86]:
all_features.shape

(2919, 354)

You can see that this conversion increases the number of features from 79 to 354. Finally, via the values attribute we can extract the NumPy format from the Pandas dataframe

In [87]:
n_train = train_data.shape[0]
train_features = all_features[:n_train].values
test_features = all_features[n_train:].values
train_labels = train_data.SalePrice.values.reshape((-1, 1))

## Training

To get started we train a linear model with squared loss. This will obviously not lead to a competition winning submission but it provides a sanity check to see whether there's meaningful information in the data. It also amounts to a minimum baseline of how well we should expect any 'fancy' model to work.

In [115]:
from sklearn.linear_model import LinearRegression, SGDRegressor, RidgeCV
from sklearn.tree import DecisionTreeRegressor

In [120]:
reg = LinearRegression()
reg.fit(train_features, train_labels)
reg.score(train_features, train_labels)

0.9391177350313022

In [124]:
#My attempt at a better model
regD = DecisionTreeRegressor()
regD.fit(train_features, train_labels.ravel())
regD.score(train_features, train_labels)

1.0

##  Predict and Submit

Now that we know what a good choice of hyperparameters should be, we might as well use all the data to train on it (rather than just $1-1/k$ of the data that is used in the crossvalidation slices). The model that we obtain in this way can then be applied to the test set. Saving the estimates in a CSV file will simplify uploading the results to Kaggle.

In [125]:
def train_and_pred(test_feature, test_data,):
    preds = regD.predict(test_features)
    # reformat it for export to Kaggle
    test_data['SalePrice'] = pd.Series(preds.reshape(1, -1)[0])
    submission = pd.concat([test_data['Id'], test_data['SalePrice']], axis=1)
    submission.to_csv('submission.csv', index=False)

In [126]:
train_and_pred(test_features, test_data)

A file, `submission.csv` will be generated by the code above (CSV is one of the file formats accepted by Kaggle).  Next, we can submit our predictions on Kaggle and compare them to the actual house price (label) on the testing data set, checking for errors. The steps are quite simple:

* Log in to the Kaggle website and visit the House Price Prediction Competition page.
* Click the “Submit Predictions” or “Late Submission” button on the right.
* Click the “Upload Submission File” button in the dashed box at the bottom of the page and select the prediction file you wish to upload.
* Click the “Make Submission” button at the bottom of the page to view your results.

![](submit.png)