In [1]:
%matplotlib inline
import pandas as pd
import sklearn
from sklearn.datasets import load_boston
from matplotlib import pyplot as plt

### Hello!

#### for this coding exercise, we're going to do some basic analysis using the Boston Housing dataset. This is a commonly used, public dataset, so the goal here is not to necessarily create the most impressive insight, but to demonstrate your ability to manipulate data and explain your reasoning behind a decision. So with that, please spend time thoroughly explaining your code and motivation. 

if you usually use R, this data set is also available publicly to R users, or you can find it here: https://www.cs.toronto.edu/~delve/data/boston/bostonDetail.html

In [2]:
# Let's start by loading in the boston data set and checking out the description of the data
boston = load_boston()

In [3]:
print(boston.DESCR)

Boston House Prices dataset

Notes
------
Data Set Characteristics:  

    :Number of Instances: 506 

    :Number of Attributes: 13 numeric/categorical predictive
    
    :Median Value (attribute 14) is usually the target

    :Attribute Information (in order):
        - CRIM     per capita crime rate by town
        - ZN       proportion of residential land zoned for lots over 25,000 sq.ft.
        - INDUS    proportion of non-retail business acres per town
        - CHAS     Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
        - NOX      nitric oxides concentration (parts per 10 million)
        - RM       average number of rooms per dwelling
        - AGE      proportion of owner-occupied units built prior to 1940
        - DIS      weighted distances to five Boston employment centres
        - RAD      index of accessibility to radial highways
        - TAX      full-value property-tax rate per $10,000
        - PTRATIO  pupil-teacher ratio by town
      

----------------------------------------------------------------------------------------------------------------

## Now that we have the data, let's make a Pandas dataframe from this data

In [4]:
df = pd.DataFrame(boston.data)
df.columns = boston.feature_names
# Our target variable is stored in another place of the data set, so let's add this to our data as well as MEDV
df['MEDV'] = boston.target

In [5]:
df.head()

Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B,LSTAT,MEDV
0,0.00632,18.0,2.31,0.0,0.538,6.575,65.2,4.09,1.0,296.0,15.3,396.9,4.98,24.0
1,0.02731,0.0,7.07,0.0,0.469,6.421,78.9,4.9671,2.0,242.0,17.8,396.9,9.14,21.6
2,0.02729,0.0,7.07,0.0,0.469,7.185,61.1,4.9671,2.0,242.0,17.8,392.83,4.03,34.7
3,0.03237,0.0,2.18,0.0,0.458,6.998,45.8,6.0622,3.0,222.0,18.7,394.63,2.94,33.4
4,0.06905,0.0,2.18,0.0,0.458,7.147,54.2,6.0622,3.0,222.0,18.7,396.9,5.33,36.2


## Question 1:


Before even calculating the mean value, let's start by examining our target varaible (the MEDV variable). <br> <b>Plot the distribution of the MEDV variable in whichever form you feel represents the data well and, in your own words, describe what you see.</b> 

In [6]:
def plot_MEDV_DIST(bos_df):
#     write function that takes in the data frame and produces a plot
#     This function should use matplotlib to plt our data in the notebook
    return None

Q1 RESPONSE:



In [None]:
plot_MEDV_DIST(df)

## Question 2a:


<b> Write a function that adds a new column ("MEDV_quint") to the dataframe, which maps the MEDV value to it's respective quintile value.</b> This column should reflect the quintile value of MEDV. E.G. the lowest 20% of MEDV should all have a MEDV_quint == 1 and the next 20% (20% > v > 40%) should all have a MEDV_quint == 2 etc. 

There are built in functions that can assist you here, so use them if it makes life easier. Also feel free to write your own helper functions. 

* How might we use this variable in the future to learn about our data? 
* When might this variable misrepresent our data if we're predicting this label value?

In [None]:
def add_MEDV_quintile(bos_df):
#     TODO: add MEDV_quint columns
    return bos_df

In [None]:
df = add_MEDV_quintile(df)

## Question 2b:


Let's say we just want to predict a binary variable and not a categorical variable.

<b>Find a better breakpoint/threshold_value in our data that we will later use for binary descrimination.</b> e.g. produce a new binary variable (bin_split == 0) if the MEDV < threshold_value and 1 if MEDV > threshold_value. Feel free to import libraries or <b> write your own helpful function</b> but please *make your code readable* and explain everything thoroughly.


* What is your threshold_value?
* How did you select that value?
* What are the benefits of your value over something like the mean or quintile values?
* What are some drawbacks of using your threshold_value?

In [None]:
def produce_threshold_binary(bos_df):
    return bos_df

In [None]:
df = produce_threshold_binary(df)

## Question 3:

Now that we have a few different target variables, let's split the data into testing and training sets. Let's try to predict your new binary variable column (bin_split). This means if we have feature values and target values aka (X, y) our y is the new binary variable you created. 

Feel free to use this: https://scikit-learn.org/0.18/modules/generated/sklearn.model_selection.train_test_split.html but explain a few things

* What's the idea behind randomly selecting our training and testing sets? Why would use a random seed value?
* Why is the training set generally bigger than the test set?
* If we also used a validation set? What does that mean? How might that help us?

In [7]:
def train_test_split(bos_df, train_ratio):
#     TODO: split our dataframe into a training set and test set with len(X_train) / (len(bos_df) == train_ratio
    return X_train, y_train, X_test, y_test

In [None]:
train_ratio = #INPUT VALUE
X_train, y_train, X_test, y_test = train_test_split(df, train_ratio)

## Question 4: 

Now that our data is split, let's work on predicting the value. I'm going to leave this section open to you. I expect some tweaking to the models and the variables, but nothing too extensive. All of your exploration should be driven by your insights *from the data*. Again, I don't expect you to come up with a novel algorithm for finding the exact partition, but I'd like to hear your thoughts while analyzing this data. Again, please write clear code that uses modular coding practices. We also care about your ability to write clear, functional code that can be used in larger libraries as we build internal tools.

* Before even training and testing the model, which variables do you think will be important?
* Which model did you select to predict the binary variable? Why did you pick this model?
* How did your model do? Explain your model using a confusion matrix (https://scikit-learn.org/stable/modules/generated/sklearn.metrics.confusion_matrix.html)
* <b>If you could request more data during the data collection process of this dataset, what variables would you want to collect? Why?</b>
* what did you not expect? What did you learn?

In [54]:
# OPEN EXPLORATION