# Intro to Machine Learning - Supervised Learning

In the previous lab we explored the 2015 King's County, Washington home sale data and prepared it for modeling.  Now we will actually use this data to test a series of algorithms, assess their performance, and select the best one.  Since we are trying to predict the sale price, this will be a regression problem.  We will fit linear, nearest neighbors, and random forest models.

In [1]:
import pandas as pd
import matplotlib.pyplot as plt

First, let's first read in the data using the ``.read_csv()`` function.

In [2]:
df = pd.read_csv('cleaned_data.csv')

Check that the data is read-in correctly and what you think it should be using ``.head()`` and ``.describe()``.

In [3]:
df.head()

Unnamed: 0.1,Unnamed: 0,price,bedrooms,bathrooms,sqft_living,sqft_lot,floors,waterfront,grade,lat,long,renovated
0,0,221900.0,0.333333,0.133333,0.123783,0.003108,0.0,0.0,0.545455,0.571498,0.217608,0.0
1,1,538000.0,0.333333,0.3,0.317107,0.004072,0.4,0.0,0.545455,0.908959,0.166113,1.0
2,2,180000.0,0.222222,0.133333,0.066759,0.005743,0.0,0.0,0.454545,0.936143,0.237542,0.0
3,3,604000.0,0.444444,0.4,0.232267,0.002714,0.0,0.0,0.545455,0.586939,0.104651,0.0
4,4,510000.0,0.333333,0.266667,0.193324,0.004579,0.0,0.0,0.636364,0.741354,0.393688,0.0


In [4]:
df.describe()

Unnamed: 0.1,Unnamed: 0,price,bedrooms,bathrooms,sqft_living,sqft_lot,floors,waterfront,grade,lat,long,renovated
count,21075.0,21075.0,21075.0,21075.0,21075.0,21075.0,21075.0,21075.0,21075.0,21075.0,21075.0,21075.0
mean,10788.67331,500419.4,0.371862,0.277226,0.240681,0.008612,0.19405,0.003986,0.599098,0.647739,0.253657,0.039573
std,6234.638222,246685.6,0.099159,0.097497,0.114436,0.024196,0.215259,0.063008,0.099627,0.224796,0.117742,0.194958
min,0.0,75000.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,5389.5,319950.0,0.333333,0.2,0.155772,0.002714,0.0,0.0,0.545455,0.498472,0.156977,0.0
50%,10770.0,445000.0,0.333333,0.3,0.222531,0.00426,0.0,0.0,0.545455,0.664951,0.239203,0.0
75%,16188.5,625000.0,0.444444,0.333333,0.305981,0.006018,0.4,0.0,0.636364,0.841242,0.328904,0.0
max,21612.0,1495000.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0


Are all of the columns you want accounted for?  Are there any extras you should drop?

In [5]:
df = df.drop(columns=['Unnamed: 0'])

Let's separate our target column (``price``) from the features.  We'll call the target ``y`` and the features ``X``.

In [6]:
y = df['price']
X = df.drop(columns='price')

Let's now split our data into a training set to develop our model and a test set that is unseen by the model against which we can test performance.  ``.train_test_split()`` is an Sci-kit Learn function that makes it easy.  We set the fraction of data we want to withhold for testing using the ``test_size`` parameter.  A value of 0.2-0.33 is usually good to start with, but once we have all of our models built, come back and change this around to see if your results are sensitive to it.  To make sure your data is randomized, shuffle it.  It is also easy to check this by plotting ``y``.

In [7]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, shuffle=True)

With our split data, we can fit our models and calculate performance metrics.

Start with a linear regression model.

In [8]:
from sklearn.linear_model import LinearRegression

lin_model = LinearRegression()
lin_model.fit(X_train, y_train)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)

Now, let's make predictions on the unseen test data.  We will compare this to y_test.  Keep in mind that y_test is a series, so we will need to get the values using ``.values``.

In [9]:
predictions = lin_model.predict(X_test)

Let's calculate some metrics.  The two we are most interested in are mean absolute error (MAE) and mean squared error (MSE).  Print out their values.

In [10]:
from sklearn.metrics import mean_absolute_error
from sklearn.metrics import r2_score

print('MAE:  ', mean_absolute_error(y_test.values, predictions))
print('r2:  ', r2_score(y_test.values, predictions))

MAE:   104052.21632014275
r2:   0.6574965957874219


Now let's fit the data with a k-nearest neighbors regressor.  Remember, this model simply finds the nearest k points and makes the prediction from a weighted average.  We'll follow the same procedure of fitting the data, making an out of sample prediction and then calculating the same metrics.  Change k using the ``n_neighbors`` parameter to see how that impacts your results.

In [11]:
from sklearn.neighbors import KNeighborsRegressor

nn_model = KNeighborsRegressor(n_neighbors=5)
nn_model.fit(X_train, y_train)

predictions = nn_model.predict(X_test)

print('MAE:  ', mean_absolute_error(y_test.values, predictions))
print('r2:  ', r2_score(y_test.values, predictions))

MAE:   70052.83449584816
r2:   0.8099227791900137


The last model we will fit is a random forest regressor.  Random forest models are very popular with practioners because they do a great job of fitting the nuances of your data, especially in large datasets like we have.

In [13]:
from sklearn.ensemble import RandomForestRegressor

rf_model = RandomForestRegressor(n_estimators=100)
rf_model.fit(X_train, y_train)

predictions = rf_model.predict(X_test)

print('MAE:  ', mean_absolute_error(y_test.values, predictions))
print('r2:  ', r2_score(y_test.values, predictions))

MAE:   60569.87624408857
r2:   0.8568772244990241


As a final exercise think a bit about our scoring metrics.  They each provide different insights into the performance of the model - what are they telling us and how can they be biased?  In particular, let's think about MAE and if it is the best metric.  Some of these values are quite high when you considering we found the median sale price to be $450,000.  What might be a better metric?

### Final Thoughts

With Sci-kit Learn we've been able to get these complex models working on our data very easily.  This greatly improves our workflow, and allows us to spend less time getting the code running and more time actually doing data science.  If you've finished early, go back and "turn the knobs" of the algorithms to see if you can improve the accuracy of your models.  Especially with the random forest model, there are a lot of parameters you can tweak.  