# &#x1F4D1; &nbsp; <span style="color:#338DD4">  P1: Predicting Boston Housing Prices </span>

### 1. References
#### Dataset
*Origin:* This dataset was taken from the StatLib library which is maintained at Carnegie Mellon University. 

*Creators:* Harrison, D. and Rubinfeld, D.L. 

*Data Set Information:* Concerns housing values in suburbs of Boston.

*Attribute Information:*

- CRIM: per capita crime rate by town 
- ZN: proportion of residential land zoned for lots over 25,000 sq.ft. 
- INDUS: proportion of non-retail business acres per town 
- CHAS: Charles River dummy variable (= 1 if tract bounds river; 0 otherwise) 
- NOX: nitric oxides concentration (parts per 10 million) 
- RM: average number of rooms per dwelling 
- AGE: proportion of owner-occupied units built prior to 1940 
- DIS: weighted distances to five Boston employment centres 
- RAD: index of accessibility to radial highways 
- TAX: full-value property-tax rate per 10,000 USD
- PTRATIO: pupil-teacher ratio by town 
- B: 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town 
- LSTAT: % lower status of the population 
- MEDV: Median value of owner-occupied homes in 1000 USD

#### Resources

Housing Data Set: https://archive.ics.uci.edu/ml/datasets/Housing



### 2. Code Library

In [2]:
from IPython.core.display import HTML
hide_code = ''
HTML('''<script>code_show = true; 
function code_display() {
    if (code_show) {
        $('div.input').each(function(id) {
            if (id == 0 || $(this).html().indexOf('hide_code') > -1) {$(this).hide();}
        });
        $('div.output_prompt').css('opacity', 0);
    } else {
        $('div.input').each(function(id) {$(this).show();});
        $('div.output_prompt').css('opacity', 1);
    }
    code_show = !code_show;
} 
$(document).ready(code_display);</script>
<form action="javascript: code_display()"><input style="color: #338DD4; background: ghostwhite; opacity: 0.9; " \
type="submit" value="Click to display or hide code"></form>''')

In [3]:
hide_code
# Importing necessary libraries
import numpy as np
import matplotlib.pyplot as pl
from sklearn import datasets
from sklearn.tree import DecisionTreeRegressor
# Make matplotlib show our plots inline 
%matplotlib inline

### 3. Statistical Analysis and Data Exploration
#### 3.1 Data Loading

In [13]:
hide_code
# Create our client's feature set for which we will be predicting a selling price
client_features = [[11.95, 0.00, 18.100, 0, 0.6590, 5.6090, 90.00, 1.385, 24, 680.0, 20.20, 332.09, 12.13]]

# Load the Boston Housing dataset into the city_data variable
boston_data = datasets.load_boston()

prices = city_data.target
features = city_data.data

print ("Boston Housing dataset loaded successfully!")

Boston Housing dataset loaded successfully!


In this section of the project, we quickly investigate a few basic statistics about the dataset we are working with. In addition, we will have look at the client's feature set in ***CLIENT_FEATURES*** and see how this particular sample relates to the features of the dataset. 

#### 3.2 Requested statistics

In [9]:
hide_code
# Number of houses in the dataset
total_houses = housing_features.shape[0]

# Number of features in the dataset
total_features = housing_features.shape[1]

# Minimum housing value in the dataset
minimum_price = np.amin(housing_prices)

# Maximum housing value in the dataset
maximum_price = np.amax(housing_prices)

# Mean house value of the dataset
mean_price = np.mean(housing_prices)

# Median house value of the dataset
median_price = np.median(housing_prices)

# Standard deviation of housing values of the dataset
std_dev = np.std(housing_prices)

# Show the calculated statistics
print ("Boston Housing dataset statistics (in $1000's):\n")
print ("Total number of houses:", total_houses)
print ("Total number of features:", total_features)
print ("Minimum house price:", minimum_price)
print ("Maximum house price:", maximum_price)
print ("Mean house price: {0:.3f}".format(mean_price))
print ("Median house price:", median_price)
print ("Standard deviation of house price: {0:.3f}".format(std_dev))

Boston Housing dataset statistics (in $1000's):

Total number of houses: 506
Total number of features: 13
Minimum house price: 5.0
Maximum house price: 50.0
Mean house price: 22.533
Median house price: 21.2
Standard deviation of house price: 9.188


#### Question 1
Of the features available for each data point, choose three that you feel are significant and give a brief description for each of what they measure.

#### Answer 1

#### Question 2
Using your client's feature set CLIENT_FEATURES, which values correspond with the features you've chosen above?
Hint: Run the code block below to see the client's data.

#### Answer 2

### 4. Evaluating Model Performance

In this second section of the project, we will develop the tools necessary for a model to make a prediction. 

In the code block below, we will implement code so that the ***shuffle_split_data*** function does the following steps.

- Randomly shuffle the input data X and target labels (housing values) y.
- Split the data into training and testing subsets, holding 30% of the data for testing.

In [14]:
hide_code
# Put any import statements you need for this code block here
from sklearn.cross_validation import train_test_split

def shuffle_split_data(X, y):
    """ Shuffles and splits data into 70% training and 30% testing subsets,
        then returns the training and testing subsets. """

    # Shuffle and split the data
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 1)

    # Return the training and testing data subsets
    return X_train, y_train, X_test, y_test


# Test shuffle_split_data
try:
    X_train, y_train, X_test, y_test = shuffle_split_data(features, prices)
    print ("Successfully shuffled and split the data!")
except:
    print ("Something went wrong with shuffling and splitting the data.")

Successfully shuffled and split the data!


#### Question 3
Why do we split the data into training and testing subsets for our model?

#### Question 4
Which performance metric below did you find was most appropriate for predicting housing prices and analyzing the total error. Why?

- Accuracy
- Precision
- Recall
- F1 Score
- Mean Squared Error (MSE)
- Mean Absolute Error (MAE)

### 5. Fitting the Model

#### Question 5
What is the grid search algorithm and when is it applicable?

#### Question 6
What is cross-validation, and how is it performed on a model? Why would cross-validation be helpful when using grid search?

### 6. Checkpoint!

### 7. Analyzing Model Performance

#### Question 7
Choose one of the learning curve graphs that are created above. What is the max depth for the chosen model? As the size of the training set increases, what happens to the training error? What happens to the testing error?

#### Question 8
Look at the learning curve graphs for the model with a max depth of 1 and a max depth of 10. When the model is using the full training set, does it suffer from high bias or high variance when the max depth is 1? What about when the max depth is 10?

#### Question 9
From the model complexity graph above, describe the training and testing errors as the max depth increases. Based on your interpretation of the graph, which max depth results in a model that best generalizes the dataset? Why?

### 8. Model Prediction

#### Question 10
Using grid search on the entire dataset, what is the optimal max_depth parameter for your model? How does this result compare to your intial intuition?

#### Question 11
With your parameter-tuned model, what is the best selling price for your client's home? How does this selling price compare to the basic statistics you calculated on the dataset?

#### Question 12
In a few sentences, discuss whether you would use this model or not to predict the selling price of future clients' homes in the Greater Boston area.

### 9. Conclusion