# House Prices - Advanced Regression Techniques
This notebook is a solution to the [House Prices - Advanced Regression Techniques](https://www.kaggle.com/c/house-prices-advanced-regression-techniques) competition. The goal of the competition is to predict the final price of each home given a set of features. The metric used to evaluate the model is the Root Mean Squared Error (RMSE) between the logarithm of the predicted value and the logarithm of the observed sales price.

## Dependencies
The following dependencies are required to run this notebook:
- numpy
- pandas
- seaborn
- kaggle
- scikit-learn
-

Here we installed them to our current environment using the following commands:

In [None]:
%conda install numpy pandas matplotlib seaborn scikit-learn tensorflow-gpu

## Data
The data is provided by the competition and can be downloaded from the competition's [data page](https://www.kaggle.com/c/house-prices-advanced-regression-techniques/data). The data is split into two files:
- train.csv: the training set
- test.csv: the test set

This section details the steps taken to preprocess the data and prepare it for training.

### Kaggle Data
Here we download an unzip our data from kaggle using the following commands:

In [None]:
import os
os.environ['KAGGLE_CONFIG_DIR'] = '..'
data_dir = '../data/house-prices'
!chmod 600 ../kaggle.json
!kaggle competitions download -c house-prices-advanced-regression-techniques -p {data_dir}
!unzip -o {data_dir}/house-prices-advanced-regression-techniques.zip -d {data_dir}

Here we import all of our relevant libraries.

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Normalization, InputLayer, Dropout
from tensorflow.keras.regularizers import l2

### Data Description

In [None]:
data = pd.read_csv(f'{data_dir}/train.csv')
data.head()

In [None]:
data.describe()

### Cleaning and Preprocessing

In [None]:
data.drop(columns=['Id', 'Utilities', 'LotShape', 'MSSubClass', 'LandContour', 'LotConfig', 'LandSlope'], inplace=True)

In [None]:
def scale_features(col):
    if col.dtype == 'int64' or col.dtype == 'float64':
        mean = col.mean()
        std = col.std()
        return (col - mean) / std
    return col

In [None]:
def clean_features(df):
    for col in df.columns:
        if df[col].dtype == 'object':
            df[col], categories = pd.factorize(df[col])
        df[col] = scale_features(df[col])
    return df.fillna(0)

In [None]:
train, valid = train_test_split(data, test_size=0.2, random_state=42)

In [None]:
features = train.drop('SalePrice', axis=1)
labels = train['SalePrice']

In [None]:
features = clean_features(features)
features.head()

In [None]:
features.info()

### Visualization

In [None]:
avg_features = pd.Series(features.loc[:, features.columns != 'Id'].mean(axis=1))
plt.scatter(avg_features, labels)
plt.xlabel('Feattures')
plt.ylabel('Sale Price')
plt.title('Sale Price vs Feattures')
plt.show()

## Model

In [None]:
model = Sequential(
    [
        InputLayer(input_shape=(features.shape[1],)),
        Dense(64, activation='relu', name='hidden_layer_1', kernel_regularizer=l2(0.01)),
        Dropout(0.2),
        Dense(64, activation='relu', name='hidden_layer_2', kernel_regularizer=l2(0.01)),
        Dropout(0.2),
        Dense(1, activation='relu', name='layer_3'),
    ]
)
model.summary()

In [None]:
lr_schedule = tf.keras.optimizers.schedules.InverseTimeDecay(
    0.001,
    decay_steps=features.shape[0] / 58 * 1000,
    decay_rate=1,
    staircase=False
)
optimizer = tf.keras.optimizers.Adam(lr_schedule)
model.compile(optimizer=optimizer, loss='mean_squared_error', metrics=['mean_squared_error'])


### Validation


In [None]:
valid_features = valid.drop('SalePrice', axis=1)
valid_labels = valid['SalePrice']
valid_features = clean_features(valid_features)

In [None]:
data = model.fit(
    features, labels, 
    epochs=1168, 
    steps_per_epoch=20, 
    batch_size=58,
    validation_data=(valid_features, valid_labels))

In [None]:
plt.plot(data.history['loss']) 
plt.xlabel('Epochs')
plt.ylabel('Loss')
plt.title('Loss vs Epochs')
plt.show()

## Submission

In [None]:
test = pd.read_csv(f'{data_dir}/test.csv')
test.drop(columns=['Id', 'Utilities', 'LotShape', 'MSSubClass', 'LandContour', 'LotConfig', 'LandSlope'], inplace=True)
clean_test = clean_features(test)
clean_test.head()

In [None]:
submission = pd.read_csv(f'{data_dir}/sample_submission.csv')
for i in range(test.shape[0]):
    submission.loc[i, 'SalePrice'] = model.predict(clean_test.iloc[i].values.reshape(1, -1))
submission.head()

In [None]:
submission.to_csv(f'{data_dir}/submission.csv', index=False)

In [None]:
!kaggle competitions submit -c house-prices-advanced-regression-techniques -f {data_dir}/submission.csv -m "Third submission using tensorflow."