### Predict House Prices with KNN and Linear Regression

On this notebook I apply a simple linear regression and an algorithm based on K-Nearest Neighbours to predict house prices.

This script uses some important libraries for DS and ML, such as Pandas and SkLearn.

The dataset from kaggle can be found here:
https://www.kaggle.com/harlfoxem/housesalesprediction



In [17]:
"""
Created on Thu Feb  1 16:50:34 2018

@author: André Miranda
"""

import pandas as pd
from IPython.display import display

data = pd.read_csv('./kc_house_data.csv', encoding="ISO-8859-1", sep=",")

# Sensing the data:
print('Fields: ', data.columns.values,'\n')
display(data.head())
print(data.describe())

Fields:  ['id' 'date' 'price' 'bedrooms' 'bathrooms' 'sqft_living' 'sqft_lot'
 'floors' 'waterfront' 'view' 'condition' 'grade' 'sqft_above'
 'sqft_basement' 'yr_built' 'yr_renovated' 'zipcode' 'lat' 'long'
 'sqft_living15' 'sqft_lot15'] 



Unnamed: 0,id,date,price,bedrooms,bathrooms,sqft_living,sqft_lot,floors,waterfront,view,...,grade,sqft_above,sqft_basement,yr_built,yr_renovated,zipcode,lat,long,sqft_living15,sqft_lot15
0,7129300520,20141013T000000,221900.0,3,1.0,1180,5650,1.0,0,0,...,7,1180,0,1955,0,98178,47.5112,-122.257,1340,5650
1,6414100192,20141209T000000,538000.0,3,2.25,2570,7242,2.0,0,0,...,7,2170,400,1951,1991,98125,47.721,-122.319,1690,7639
2,5631500400,20150225T000000,180000.0,2,1.0,770,10000,1.0,0,0,...,6,770,0,1933,0,98028,47.7379,-122.233,2720,8062
3,2487200875,20141209T000000,604000.0,4,3.0,1960,5000,1.0,0,0,...,7,1050,910,1965,0,98136,47.5208,-122.393,1360,5000
4,1954400510,20150218T000000,510000.0,3,2.0,1680,8080,1.0,0,0,...,8,1680,0,1987,0,98074,47.6168,-122.045,1800,7503


                 id         price      bedrooms     bathrooms   sqft_living  \
count  2.161300e+04  2.161300e+04  21613.000000  21613.000000  21613.000000   
mean   4.580302e+09  5.400881e+05      3.370842      2.114757   2079.899736   
std    2.876566e+09  3.671272e+05      0.930062      0.770163    918.440897   
min    1.000102e+06  7.500000e+04      0.000000      0.000000    290.000000   
25%    2.123049e+09  3.219500e+05      3.000000      1.750000   1427.000000   
50%    3.904930e+09  4.500000e+05      3.000000      2.250000   1910.000000   
75%    7.308900e+09  6.450000e+05      4.000000      2.500000   2550.000000   
max    9.900000e+09  7.700000e+06     33.000000      8.000000  13540.000000   

           sqft_lot        floors    waterfront          view     condition  \
count  2.161300e+04  21613.000000  21613.000000  21613.000000  21613.000000   
mean   1.510697e+04      1.494309      0.007542      0.234303      3.409430   
std    4.142051e+04      0.539989      0.086517    

In [18]:
# Cleaning Data
# Removed date because only two years are present
data.drop(data.columns[:2], axis= 1, inplace = True)
# Removed lat and long, we have geo info in the zip code
data.drop(['lat','long'], axis = 1, inplace = True)
# Removed yr_renovated, because 96% are 0 (non renovated)
data.drop('yr_renovated', axis = 1, inplace = True)

print('Different zipcodes: ',len(data['zipcode'].value_counts(sort= True)))
print('Zipcode with less houses associated: ',min(data['zipcode'].value_counts(sort= True)),'\n')
# I'll keep zipcode and treat it as categorical, since there is not that much granularity, but I'll try both options

Different zipcodes:  70
Zipcode with less houses associated:  50 



In [19]:
# Encoding categorical feature
data = pd.get_dummies(data, columns  = ['zipcode'])

# Select numeric variables to normalize, just excluded waterfront because it is binary
data_num = data[['bedrooms', 'bathrooms', 'sqft_living', 'sqft_lot', 'floors', 'view', 'condition', 'grade', 'sqft_above', 'sqft_basement', 'yr_built', 'sqft_living15', 'sqft_lot15']]
data_num_cols = list(data_num.columns.values)

from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler()
scaler.fit(data_num)
MinMaxScaler()
normalized = scaler.transform(data_num)
data_num_norm = pd.DataFrame(normalized, columns = data_num_cols)

# Merging the data together again, now normalized and with the categorical values handled.
data[data_num_cols] = data_num_norm

# Shuffle the dataset, to break any simetry
data = data.sample(frac=1).reset_index(drop=True)

# Get train and test sets
from sklearn.model_selection import train_test_split

data_train, data_test = train_test_split(data, test_size = 0.2)

# Training Linear Model
from sklearn import linear_model

# Get column names 
data_colms = data_train.columns.values

# Initialize linear model
clf = linear_model.LinearRegression()
clf.fit(data_train[data_colms[1:]], data_train['price'])

from sklearn.metrics import mean_squared_error

test_predictions_linear = clf.predict(data_test[data_colms[1:]])
error_linear = mean_squared_error(data_test['price'], test_predictions_linear)
error_linear = error_linear ** 0.5
# Used SQRT of the mean squared error as performance metric
print('Error for linear regression: ',round(error_linear),'\n')

# Training an algorithm based on K-NN
from sklearn.neighbors import NearestNeighbors

# Initialize KNN model
neighbors = 5
nbrs = NearestNeighbors(n_neighbors = neighbors).fit(data_train[data_colms[1:]])

# Get indexes of the KNN present in the train data, for each element of the test data
dist, idx = nbrs.kneighbors(data_test[data_colms[1:]])

# Calculate predicted prices for the test data as the average price of the KNN in the train data
data_test['predicted_knn'] = None
for i in range(data_test.shape[0]):   
    avg = 0
    for index in idx[i]:
        avg = avg + data_train['price'].iloc[index]
    avg = avg / neighbors
    data_test['predicted_knn'].set_value(i, avg, takeable = True)
    
error_knn = mean_squared_error(data_test['price'], data_test['predicted_knn'])
error_knn = error_knn ** 0.5
print('Error for knn regression: ',round(error_knn),'\n') 

# To compare some samples
data_test['predicted_linear'] = test_predictions_linear 
print('To compare some samples:')
display(data_test[['price','predicted_linear','predicted_knn']].head())

Error for linear regression:  157650.0 



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy


Error for knn regression:  161824.0 

To compare some samples:


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy


Unnamed: 0,price,predicted_linear,predicted_knn
12760,410000.0,407296.0,377800.0
10397,652450.0,715048.0,612800.0
19412,1300000.0,1181076.0,1076590.0
343,1890000.0,1690712.0,1770000.0
11487,502000.0,609640.0,533990.0


## Analysis:

We can tune the number of neighbors, features to use and some calculation parameters to achieve better results.
The error for the two methods are very similar, due to the random sets sometimes knn are better sometimes linear regression, although it is clear than the knn approach makes sense and produces good results.