# Task 1: Introduction

---

For this project, we are going to work on evaluating price of houses given the following features:

1. Year of sale of the house
2. The age of the house at the time of sale
3. Distance from city center
4. Number of stores in the locality
5. The latitude
6. The longitude

![Regression](images/regression.png)

Note: This notebook uses `python 3` and these packages: `tensorflow`, `pandas`, `matplotlib`, `scikit-learn`.

## 1.1: Importing Libraries & Helper Functions

First of all, we will need to import some libraries and helper functions. This includes TensorFlow and some utility functions that I've written to save time.

In [1]:
pip install tensorflow

Collecting tensorflow
  Downloading tensorflow-2.2.0-cp36-cp36m-manylinux2010_x86_64.whl (516.2 MB)
[K     |████████████████████████████████| 516.2 MB 3.8 kB/s eta 0:00:01    |██▏                             | 34.3 MB 3.3 MB/s eta 0:02:27     |███▎                            | 53.2 MB 1.7 MB/s eta 0:04:30     |████                            | 64.6 MB 6.4 MB/s eta 0:01:11     |█████                           | 82.2 MB 220 kB/s eta 0:32:53     |███████▎                        | 118.0 MB 4.0 MB/s eta 0:01:41     |████████                        | 128.1 MB 1.5 MB/s eta 0:04:13     |█████████████████▌              | 282.3 MB 5.2 MB/s eta 0:00:45     |████████████████████            | 321.5 MB 7.2 MB/s eta 0:00:28     |████████████████████▏           | 326.2 MB 4.0 MB/s eta 0:00:48     |██████████████████████████▍     | 424.8 MB 5.5 MB/s eta 0:00:17     |████████████████████████████▉   | 464.8 MB 693 kB/s eta 0:01:15
Collecting tensorboard<2.3.0,>=2.2.0
  Downloading tensorboard-2.2.2-py3-

In [13]:
pip install --ignore-installed --upgrade tensorflow-gpu

Collecting tensorflow-gpu
  Downloading tensorflow_gpu-2.2.0-cp36-cp36m-manylinux2010_x86_64.whl (516.2 MB)
[K     |████████████████████████████████| 516.2 MB 3.2 kB/s eta 0:00:01    |██▋                             | 41.3 MB 5.3 MB/s eta 0:01:31     |██▋                             | 42.1 MB 5.3 MB/s eta 0:01:30     |███▉                            | 61.8 MB 5.3 MB/s eta 0:01:27     |████▍                           | 70.3 MB 175 kB/s eta 0:42:23     |████▌                           | 72.4 MB 964 kB/s eta 0:07:41     |████████▉                       | 142.0 MB 5.4 MB/s eta 0:01:10     |█████████▍                      | 151.4 MB 7.4 MB/s eta 0:00:50     |██████████                      | 162.4 MB 3.2 MB/s eta 0:01:50     |███████████▍                    | 183.8 MB 7.1 MB/s eta 0:00:47     |████████████████▌               | 265.9 MB 4.2 MB/s eta 0:01:00     |██████████████████▌             | 297.7 MB 9.5 MB/s eta 0:00:24     |███████████████████▊            | 317.4 MB 1.2 MB/s eta 0:02:

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import tensorflow as tf

from utils import *
from sklearn.model_selection import train_test_split
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout
from tensorflow.keras.callbacks import EarlyStopping, LambdaCallback

%matplotlib inline
tf.logging.set_verbosity(tf.compat.v1.logging.ERROR)

print('Libraries imported.')

# Task 2: Importing the Data

## 2.1: Importing the Data

The dataset is saved in a `data.csv` file. We will use `pandas` to take a look at some of the rows.

In [None]:
df = pd.read_csv('data.csv', names = column_names) 
df.head()

## 2.2: Check Missing Data

It's a good practice to check if the data has any missing values. In real world data, this is quite common and must be taken care of before any data pre-processing or model training.

In [None]:
df.isna().sum()

# Task 3: Data Normalization

## 3.1: Data Normalization

We can make it easier for optimization algorithms to find minimas by normalizing the data before training a model.

In [None]:
df = df.iloc[:,1:]
df_norm = (df - df.mean()) / df.std()
df_norm.head()

## 3.2: Convert Label Value

Because we are using normalized values for the labels, we will get the predictions back from a trained model in the same distribution. So, we need to convert the predicted values back to the original distribution if we want predicted prices.

In [None]:
y_mean = df['price'].mean()
y_std = df['price'].std()

def convert_label_value(pred):
    return int(pred * y_std + y_mean)

print(convert_label_value(0.350088))

# Task 4: Create Training and Test Sets

## 4.1: Select Features

Make sure to remove the column __price__ from the list of features as it is the label and should not be used as a feature.

In [None]:
X = df_norm.iloc[:, :6]
X.head()

## 4.2: Select Labels

In [None]:
Y = df_norm.iloc[:, -1]
Y.head()

## 4.3: Feature and Label Values

We will need to extract just the numeric values for the features and labels as the TensorFlow model will expect just numeric values as input.

In [None]:
X_arr = X.values
Y_arr = Y.values

print('X_arr shape: ', X_arr.shape)
print('Y_arr shape: ', Y_arr.shape)

## 4.4: Train and Test Split

We will keep some part of the data aside as a __test__ set. The model will not use this set during training and it will be used only for checking the performance of the model in trained and un-trained states. This way, we can make sure that we are going in the right direction with our model training.

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X_arr, Y_arr, test_size = 0.05, shuffle = True, random_state=0)

print('X_train shape: ', X_train.shape)
print('y_train shape: ', y_train.shape)
print('X_test shape: ', X_test.shape)
print('y_test shape: ', y_test.shape)

# Task 5: Create the Model

## 5.1: Create the Model

Let's write a function that returns an untrained model of a certain architecture.

In [None]:
def get_model():
    
    model = Sequential([
        Dense(10, input_shape = (6,), activation = 'relu'),
        Dense(20, activation = 'relu'),
        Dense(5, activation = 'relu'),
        Dense(1)
    ])

    model.compile(
        loss='mse',
        optimizer='adadelta'
    )
    
    return model

model = get_model()
model.summary()

# Task 6: Model Training

## 6.1: Model Training

We can use an `EarlyStopping` callback from Keras to stop the model training if the validation loss stops decreasing for a few epochs.

In [None]:
early_stopping = EarlyStopping(monitor='val_loss', patience = 5)

model = get_model()

preds_on_untrained = model.predict(X_test)

history = model.fit(
    X_train, y_train,
    validation_data = (X_test, y_test),
    epochs = 1000,
    callbacks = [early_stopping]
)

## 6.2: Plot Training and Validation Loss

Let's use the `plot_loss` helper function to take a look training and validation loss.

In [None]:
plot_loss(history)

# Task 7: Predictions

## 7.1: Plot Raw Predictions

Let's use the `compare_predictions` helper function to compare predictions from the model when it was untrained and when it was trained.

In [None]:
preds_on_trained = model.predict(X_test)

compare_predictions(preds_on_untrained, preds_on_trained, y_test)

## 7.2: Plot Price Predictions

The plot for price predictions and raw predictions will look the same with just one difference: The x and y axis scale is changed.

In [None]:
price_on_untrained = [convert_label_value(y) for y in preds_on_untrained]
price_on_trained = [convert_label_value(y) for y in preds_on_trained]
price_y_test = [convert_label_value(y) for y in y_test]

compare_predictions(price_on_untrained, price_on_trained, price_y_test)