# House Price Prediction with Machine Learning

## Overview
### What You'll Learn
In this section, you'll learn
1. How to use scikit-learn to create, train, and test a housing price predictor
2. How to use Tensorflow to create, train, and test a neural

### Prerequisites
Before starting this section, you should have an understanding of
1. [Basic Python](https://github.com/HackBinghamton/PythonWorkshop) (functions, loops, lists) 
2. [scikit-learn](https://colab.research.google.com/github/HackBinghamton/MachineLearningWorkshopWeek1/blob/master/intro_ml_scikit.ipynb)
3. [Tensorflow](https://colab.research.google.com/github/HackBinghamton/MachineLearningWorkshopWeek1/blob/master/intro_neural_networks_tf.ipynb)

### Introduction
This section will give you the opportunity to apply what you've learned with scikit-learn and Tensorflow to a new dataset: the Boston housing price dataset.

---

## Prediction with scikit-learn

### 1. Loading the Data

The Boston housing price dataset is one of several datasets included with `sklearn`. It contains 506 samples of houses in the Boston area, with measurements of 13 attributes of each (e.g. per capita crime, tax rate, pupil-teacher ratio, etc.), with the 'target' (y) variable being the price of the house. The goal is to train a model to find a regression from the x-data to the y-data.

Accessing the data in the Boston house-price dataset is effectively the same as accessing the MNIST `digits` dataset.

In [None]:
from sklearn.datasets import load_boston
from sklearn.model_selection import train_test_split
boston = load_boston()

xtrain, xtest, ytrain, ytest = train_test_split(boston.data,
                                                boston.target,
                                                test_size=0.5,    # Tweak to your liking
                                                random_state=42)  # Set random seed

Let's see if we can come up with a model that will accurately predict the price of a house given its attributes...

### 2. Choosing a Model
`sklearn` comes with a variety of models that excel at different tasks. 

There is one large distinction between models to make: There are *classifiers*, and *regressions*. __Classifiers pick from a list of label options to predict what something is (e.g. apples, oranges), while regressions guess a value on a continuous spectrum (e.g. a number on the Richter scale, *the price of a house*)__

 - __Classifiers *label* things, e.g.:__
    - This is picture of a cat
    - This data seems to represent an orange
    - This sounds like this person's voice
 - __Regressions *estimate* things, e.g.:__
    - This was probably a 3.5 on the Richter scale
    - This stock will grow 5% by tomorrow
    - *This house probably cost $100k

For this dataset, it is most valid to use a *regression* to predict the prices of the houses.

`sklearn`'s `linear_model` family comes with numerous regressions to apply, like `LinearRegression`, `Lasso`, `Ridge`, and many more, which can be found at the [official documentation](https://scikit-learn.org/stable/supervised_learning.html).

### 3. Training and Using a Model

Thankfully, many of `sklearn`'s models work the same, and can be used by replacing `some_model_variety` below with the model family you'd like to load, and `ModelOfChoice` with the specific model you'd like to use.

```python
# Load up the model to use
from sklearn.some_model_variety import ModelOfChoice

# Load your data as shown above...

# Create your model
model = ModelOfChoice()

# Train your model
model.fit(xtrain, ytrain)

# Check its accuracy
print("ModelofChoice Accuracy:", str(model.score(xtest, ytest) * 100) + "%")
```

Given this template code below, see what you can find! What models work the best for this dataset? What can you tweak to get better results?

In [None]:
# Import model (also try Lasso, Ridge)
from sklearn.linear_model import Ridge

# Create the model
model = Ridge()

# Train it
model.fit(xtrain, ytrain)

# Check the accuracy (if that is your intention)
print("Accuracy:", str(model.score(xtest, ytest) * 100) + "%")

Congrats! You just used ML to make a model to predict housing prices!

---

## Prediction with Tensorflow

### 1. Loading the Data

Thankfully, Tensorflow comes with its own version of the Boston housing dataset, which can be accessed similarly to the MNIST set.

In [None]:
import tensorflow as tf

boston = tf.keras.datasets.boston_housing

(xtrain, ytrain), (xtest, ytest) = boston.load_data()

### 2. Preparing the Data

Due to the math behind Tensorflow, the ranges and average values of different x-variables (e.g. crime rate vs. average rooms per house) have an impact on what the model considers important. For example, crime rates are formatted as percentages between 0 and 1, while the property tax is a number that usually lies in the 200s. Because of this difference, a neural network may consider property tax to be more hundreds of times more important than the crime rate.

To take care of this, we must *normalize* our data -- effectively, this means to tweak them so that they sit on the same scale. One common way to do this is with *z-scores* (ignore the name, it just means to normalize).

For each variable in our data, we'll perform the following operation:

```x = (x - avg) / stddev```

Thankfully, Tensorflow's data is lovely to work with and takes away much of the busywork for us:

In [None]:
train_mean = xtrain.mean(axis=0)
train_stddev = xtrain.std(axis=0)
xtrain = (xtrain - train_mean) / train_stddev
xtest = (xtest - train_mean) / train_stddev

### 3. Building the Model

The neural network we construct here is remarkably similar to the one in the Intro Tensorflow section, with a couple of marked differences.

In [None]:
model = tf.keras.models.Sequential([
    # (No need to vectorize our input with Flatten -- this is already taken care of.)
    # (However, we still need to specify our input shape in the first layer...)
    
    # A Dense layer with relu, with the input shape tweaked to fit the 13 inputs of this dataset.
    tf.keras.layers.Dense(64, activation='relu', input_shape=(13,)),
    
    # A Dropout layer to prevent overfitting.
    tf.keras.layers.Dropout(0.02),
    
    # Since we only have one dimension in our output (housing price) we only need 1 in our final Dense layer.
    tf.keras.layers.Dense(1)
])

### 4. Compiling and Training the Model

We can compile our model just like before, but with the subtraction of the accuracy metric -- this is because the accuracy used by Tensorflow doesn't make sense for continuous values.

Also, we modify the loss function from `sparse_categorical_crossentropy` to `mean_squared_error` since  `sparse_categorical_crossentropy` is for categorization problems, and this is a regression.

Finally, we train just like usual. Just be careful with the epochs to avoid overfitting!

In [None]:
model.compile(optimizer='adam', loss="mean_squared_error")

model.fit(xtrain, ytrain, epochs=60)
print("Training complete!")

### 5. Testing the Model

Now that we've trained our model, let's test it out on some data! Run the code below to see the predicted and actual 
values of the test data with accuracy!

In [None]:
preds = model.predict(xtest).flatten()
results = ytest

# Iterate through each test and pretty-print
print("Predict  Actual   Accuracy")
sum_acc = 0
for i in range(len(xtest)):
    acc = 100 - (100 * abs(preds[i] - ytest[i]) / ytest[i])
    sum_acc += acc
    print("{:5.2f}    {:5.2f}    {:5.2f}".format(preds[i], ytest[i], acc))

print("Average Accuracy: {}%".format(sum_acc / len(xtest)))