# Boston Housing Prices Predicition using CNN

This is a regression problem which consists of predicting a continuous
value instead instead of a discrete label.

In [1]:
from tensorflow import keras



### Loading our dataset

In [2]:
from keras.datasets import boston_housing
(train_data, train_targets), (test_data, test_targets) = boston_housing.load_data()

Using TensorFlow backend.


## Features
1. Per capita crime rate.
2. Proportion of residential land zoned for lots over 25,000 square feet
3. Proportion of non-retail business acres per town.
4. Charles River dummy variable (= 1 if tract bounds river; 0 otherwise).
5. Nitric oxides concentration (parts per 10 million).
6. Average number of rooms per dwelling.
7. Proportion of owner-occupied units built prior to 1940.
8. Weighted distances to five Boston employment centres.
9. Index of accessibility to radial highways.
10. Full-value property-tax rate per 10,000 dollars
11. Pupil-teacher ratio by town.
12. 1000 * (Bk - 0.63) ** 2 where Bk is the proportion of Black people by town.
13. % lower status of the population.

# Data Prep

As the ranges arent the same across all data it would make feeding into the neural network and getting a good result difficult. <br>
The normal practice for this is feature-wise normalization <br>
Normalization is a scaling technique in which values are shifted and rescaled so that they end up ranging between 0 and 1. It is also known as Min-Max scaling. <br>
We will apply this technique to each feature as needed

- We will subtract the mean of the feature and divide by the standard deviation 

- Thus the feature will be centered around 0 and will have a unit standard deviation. 

### Normalizing the data

In [3]:
mean = train_data.mean(axis=0)
train_data -= mean
std = train_data.std(axis=0)
train_data /= std
test_data -= mean
test_data /= std

Note that the quantities that we use for normalizing the test data have been computed using the training data.

# Model

As we will need to instantiate the same model multiple times we are going to make the model as a function 

In [6]:

from keras import models
from keras import layers
def build_model():
    model = models.Sequential()
    model.add(layers.Dense(64, activation='relu',
                           input_shape=(train_data.shape[1],)))
    model.add(layers.Dense(64, activation='relu'))
    model.add(layers.Dense(1))
    model.compile(optimizer='rmsprop', loss='mse', metrics=['mae'])
    return model

### Loss Function (mse = Mean Squared Error)
In MSE, we calculate the square of our error and then take it’s mean. This is a quadratic scoring method, meaning, the penalty is proportional to not the error (like in MAE), but to the square of the error, which gives relatively higher weight (penalty) to large errors/outliers, while smoothening the gradient for smaller errors.<br>
- **Advantages**:<br>
    1. For small errors, MSE helps converge to the minima efficiently, as the gradient reduces gradually.
- **Disadvantages**:<br>
    1. Squaring the values does increases the rate of training, but at the same time, an extremely large loss may lead to a drastic jump during backpropagation, which is not desirable.
    2. MSE is also sensitive to outliers, i.e. outliers in data may impact our network more, as the loss for these will be considerably higher.
    
### Metric (mae = Mean Absolute Error)
MAE is the simplest error function, it literally just calculates the absolute difference (discards the sign) between the actual and predicted values and takes it’s mean.
- **Advantages:** <br>
    1. MAE is the simplest method to calculate the loss.
    2. Due to its simplicity, it is computationally inexpensive.
- **Disadvantages:** <br>
    1. MAE calculates loss by considering all the errors on the same scale. For example, if one of the output is on the scale of hundred while other is on the scale of thousand, our network won’t be able to distinguish between them just based on MAE, and so, it’s hard to alter weights during backpropagation.
    2. MAE is a linear scoring method, i.e. all the errors are weighted equally while calculating the mean. This means that while backpropagation, we may just jump past the minima due to MAE’s steep nature.
    
### Lack of Activation ? 
Our network ends with a single unit, and no activation (i.e. it will be linear layer). <br>
This is a typical setup for scalar regression (i.e. regression where we are trying to predict a single continuous value).<br>
Applying an activation function would constrain the range that the output can take.<br>
For instance if we applied a activation function to our sigmoid last layer, the network could only learn to predict values between 0 and 1. <br>
Because the last layer is purely linear, the network is free to learn to predict values in any range

# Validation 
due to our limited sample size we are going to implement cross validation 

### k-Fold Cross-Validation
Cross-validation is a resampling procedure used to evaluate machine learning models on a limited data sample. <br>

The procedure has a single parameter called k that refers to the number of groups that a given data sample is to be split into. As such, the procedure is often called k-fold cross-validation. When a specific value for k is chosen, it may be used in place of k in the reference to the model, such as k=10 becoming 10-fold cross-validation.

Cross-validation is primarily used to estimate the skill of a machine learning model on unseen data. That is, to use a limited sample in order to estimate how the model is expected to perform in general when used to make predictions on data not used during the training of the model.

It is a popular method because it is simple to understand and because it generally results in a less biased or less optimistic estimate of the model skill than other methods, such as a simple train/test split.

#### Precedure Outline
1. shuffle our data set randomly (think deck of cards)
2. split the dataset into k groups
3. For each group: 
    - Take one group as our test data 
    - Take the remaining (K-1) groups as training data
    - Fit our model on training data and evaluate it on the test data
    - Save the evaluation and get rid of the model
4. We can see how well our model performs using the evaluation scores

#### Choosing a k value
So how do we choose what k will be?

Common Method for Choosing your k value


## K- fold Validation

In [19]:
import numpy as np
k = 4
num_val_samples = len(train_data) // k
num_epochs = 100
all_scores = []
all_mae_histories = []
for i in range(k):
    print('processing fold #', i)
    # Prepare the validation data: data from partition # k
    val_data = train_data[i * num_val_samples: (i + 1) * num_val_samples]
    val_targets = train_targets[i * num_val_samples: (i + 1) * num_val_samples]
    # Prepare the training data: data from all other partitions
    partial_train_data = np.concatenate([train_data[:i * num_val_samples],train_data[(i + 1) * num_val_samples:]],axis=0)
    partial_train_targets = np.concatenate([train_targets[:i * num_val_samples],train_targets[(i + 1) * num_val_samples:]],axis=0)
    # Build the Keras model (already compiled)
    model= build_model()
    # Train the model (in silent mode, verbose=0)
    model.fit(partial_train_data, partial_train_targets,epochs=num_epochs, batch_size=1, verbose=0)
    # Evaluate the model on the validation data
    # Train the model (in silent mode, verbose=0)
    history = model.fit(partial_train_data, partial_train_targets,validation_data=(val_data, val_targets),
                        epochs=num_epochs, batch_size=1, verbose=1)
    mae_history = history.history['val_mae']
    all_mae_histories.append(mae_history)
    

processing fold # 0
Train on 303 samples, validate on 101 samples
Epoch 1/3
Epoch 2/3
Epoch 3/3
processing fold # 1
Train on 303 samples, validate on 101 samples
Epoch 1/3
Epoch 2/3
Epoch 3/3
processing fold # 2
Train on 303 samples, validate on 101 samples
Epoch 1/3
Epoch 2/3
Epoch 3/3
processing fold # 3
Train on 303 samples, validate on 101 samples
Epoch 1/3
Epoch 2/3
Epoch 3/3


In [20]:
all_mae_histories

[[2.433706045150757, 2.343496084213257, 2.3067471981048584],
 [2.7326672077178955, 2.8479158878326416, 2.64471435546875],
 [2.6596012115478516, 2.9044253826141357, 2.7074146270751953],
 [3.410062551498413, 3.504591464996338, 3.06933856010437]]

In [23]:
average_mae_history = [np.mean([x[i] for x in all_mae_histories]) for i in range(num_epochs)]
average_mae_history


[2.8090092539787292, 2.900107204914093, 2.6820536851882935]

In [24]:
# Get a fresh, compiled model.
model = build_model()
# Train it on the entirety of the data.
model.fit(train_data, train_targets,
 epochs=80, batch_size=16, verbose=0)
test_mse_score, test_mae_score = model.evaluate(test_data, test_targets)



In [25]:
test_mae_score

2.802682399749756

Prediction is still off but more accurate

# Conclusion and takeways

- We have a varity of loss functions to use in regression which are different to the ones used for classification 
- When features in the input data have values in different ranges, each feature should be scaled independently as a preprocessing step.
- When there is little data available, using K-Fold validation is a great way to reliably evaluate a model.
- When little training data is available, it is preferable to use a small network with very few hidden layers (typically only one or two), in order to avoid severe overfitting

### Loss Functions for Regression 

- #### Mean Squared Error Loss L2
    - The Mean Squared Error, or MSE, loss is the default loss to use for regression problems.
    - It is the loss function to be evaluated first and only changed if you have a good reason.
    - Mean squared error is calculated as the average of the squared differences between the predicted and actual values. The result is always positive regardless of the sign of the predicted and actual values and a perfect value is 0.0. The squaring means that larger mistakes result in more error than smaller mistakes, meaning that the model is punished for making larger mistakes.
    - The mean squared error loss function can be used in Keras by specifying ‘mse‘ or ‘mean_squared_error‘ as the loss function when compiling the model.
    - It is recommended that the output layer has one node for the target variable and the linear activation function is used.
    - model.add(Dense(1, activation='linear'))

- #### Mean Squared Logarithmic Error Loss
    - There may be regression problems in which the target value has a spread of values and when predicting a large value, you may not want to punish a model as heavily as mean squared error.
    - Instead, you can first calculate the natural logarithm of each of the predicted values, then calculate the mean squared error. This is called the Mean Squared Logarithmic Error loss, or MSLE for short.
    - It has the effect of relaxing the punishing effect of large differences in large predicted values.
    - As a loss measure, it may be more appropriate when the model is predicting unscaled quantities directly.
    - The model can be updated to use the ‘mean_squared_logarithmic_error‘ loss function and keep the same configuration for the output layer.

- #### Mean Absolute Error Loss L1
    - On some regression problems, the distribution of the target variable may be mostly Gaussian, but may have outliers, e.g. large or small values far from the mean value.
    - The Mean Absolute Error, or MAE, loss is an appropriate loss function in this case as it is more robust to outliers. It is calculated as the average of the absolute difference between the actual and predicted values.
    - The model can be updated to use the ‘mean_absolute_error‘ loss function and keep the same configuration for the output layer.

    #### Huber Loss
    A comparison between L1 and L2 loss yields the following results:
    1. L1 loss is more robust than its counterpart. <br>

    On taking a closer look at the formulas, one can observe that if the difference between the predicted and the actual value is high, L2 loss magnifies the effect when compared to L1. Since L2 succumbs to outliers, L1 loss function is the more robust loss function.

    2. L1 loss is less stable than L2 loss. <br>

    Since L1 loss deals with the difference in distances, a small horizontal change can lead to the regression line jumping a large amount. Such an effect taking place across multiple iterations would lead to a significant change in the slope between iterations.

    On the other hand, MSE ensures the regression line moves lightly for a small adjustment in the data point.

    Huber Loss combines the robustness of L1 with the stability of L2, essentially the best of L1 and L2 losses. For huge errors, it is linear and for small errors, it is quadratic in nature.


###  Evaluation Metrics for Regression

- #### M.A.E (Mean Absolute Error)
    - It is the simplest & very widely used evaluation technique. <br>
    - It is simply the mean of difference b/w actual & predicted values.  <br>
- #### M.S.E (Mean Squared Error)
    - Another evaluation technique is the Mean Squared Error.  <br>
    - It takes the average of the square of the error. <br>
    - Here, the error is the difference b/w actual & predicted values. <br>
- #### R.M.S.E (Root Mean Squared Error)
    - Root mean squared Error is another technique. <br>
    - It solves the problem in the above technique. <br>
    - It squares the error & then it takes the square root of the total average function. <br>
- #### R.M.S.L.E (Root Mean Squared Log Error)
    - In case of RMSLE, you take the log of the predictions and actual values.  <br>
- #### R — Squared
    - Now, we come to another technique called R — Squared, whose actual name is Relative Squared Error. <br>
    - This method helps us to calculate the relative error. This technique helps us to judge, which algorithm is better based on their mean squared errors. <br>
    - If x >1, this means that, the MSE of the numerator is greater than the MSE of the baseline model which in turn means that, the new model is worse than the baseline model. <br>
    - Higher is the R — Squared, better is the model. <br>


