### Predict House Prices with KNN and Linear Regression

On this notebook I apply a simple linear regression and an algorithm based on K-Nearest Neighbours to predict house prices.

This script uses some important libraries for DS and ML, such as Pandas and SkLearn.

The dataset from kaggle can be found here:
https://www.kaggle.com/harlfoxem/housesalesprediction



In [1]:
"""
Created on Thu Feb  1 16:50:34 2018

@author: André Miranda
"""

import pandas as pd

data = pd.read_csv('./kc_house_data.csv', encoding="ISO-8859-1", sep=",")

# Cleaning Data
# Removed date because only two years are present
data.drop(data.columns[:2], axis= 1, inplace = True)
# Removed lat and long, we have geo info in the zip code
data.drop(['lat','long'], axis = 1, inplace = True)
# Removed yr_renovated, since 96% are 0 (non renovated)
data.drop('yr_renovated', axis = 1, inplace = True)

print('Different zipcodes: ',len(data['zipcode'].value_counts(sort= True)))
print('Zipcode with less houses associated: ',min(data['zipcode'].value_counts(sort= True)),'\n')
# I'll keep zipcode and treat it as categorical, since there is not that much granularity, but I'll try both options

# Encoding categorical feature
data = pd.get_dummies(data, columns  = ['zipcode'])

# Select numeric variables to normalize, just excluded waterfront because it is binary
data_num = data[['bedrooms', 'bathrooms', 'sqft_living', 'sqft_lot', 'floors', 'view', 'condition', 'grade', 'sqft_above', 'sqft_basement', 'yr_built', 'sqft_living15', 'sqft_lot15']]
data_num_cols = list(data_num.columns.values)

from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler()
scaler.fit(data_num)
MinMaxScaler()
normalized = scaler.transform(data_num)
data_num_norm = pd.DataFrame(normalized, columns = data_num_cols)

# Merging the data together again, now normalized and with the categorical values handled.
data[data_num_cols] = data_num_norm

# Shuffle the dataset, to break any simetry
data = data.sample(frac=1).reset_index(drop=True)

# Get train and test sets
from sklearn.model_selection import train_test_split

data_train, data_test = train_test_split(data, test_size = 0.2)

# Training Linear Model
from sklearn import linear_model

# Get column names 
data_colms = data_train.columns.values

# Initialize linear model
clf = linear_model.LinearRegression()
clf.fit(data_train[data_colms[1:]], data_train['price'])

from sklearn.metrics import mean_squared_error

test_predictions_linear = clf.predict(data_test[data_colms[1:]])
error_linear = mean_squared_error(data_test['price'], test_predictions_linear)
error_linear = error_linear ** 0.5
# Used SQRT of the mean squared error as performance metric
print('Error for linear regression: ',round(error_linear),'\n')

# Training an algorithm based on K-NN
from sklearn.neighbors import NearestNeighbors

# Initialize KNN model
neighbors = 5
nbrs = NearestNeighbors(n_neighbors = neighbors).fit(data_train[data_colms[1:]])

# Get indexes of the KNN present in the train data, for each element of the test data
dist, idx = nbrs.kneighbors(data_test[data_colms[1:]])

# Calculate predicted prices for the test data as the average price of the KNN in the train data
data_test['predicted_knn'] = None
for i in range(data_test.shape[0]):   
    avg = 0
    for index in idx[i]:
        avg = avg + data_train['price'].iloc[index]
    avg = avg / neighbors
    data_test['predicted_knn'].set_value(i, avg, takeable = True)
    
error_knn = mean_squared_error(data_test['price'], data_test['predicted_knn'])
error_knn = error_knn ** 0.5
print('Error for knn regression: ',round(error_knn),'\n') 

# To compare some samples
data_test['predicted_linear'] = test_predictions_linear 
print('To compare some samples:')
print(data_test[['price','predicted_linear','predicted_knn']].head())


Different zipcodes:  70
Zipcode with less houses associated:  50 

Error for linear regression:  182902.0 



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy


Error for knn regression:  198504.0 

To compare some samples:
          price  predicted_linear predicted_knn
6442   499950.0          661216.0        563780
3197   790000.0          794976.0        833380
2160   487000.0          595296.0        610960
13163  150000.0          221824.0        223500
21270  469500.0          414560.0        443450


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy


## Analysis:

We can tune the number of neighbors and some calculation parameters to achieve better results.
The error for the two methods are very similar, due to the random sets sometimes knn are better sometimes linear regression, although it is clear than the knn approach makes sense and produces good results.