# House Price Prediction using KNN (K-Nearest Neighbours) and Decision Tree Models

## Step 1:
### Q1: How many rows are there?

In [1]:
import pandas as pd

hPrice = pd.read_csv('House_Prices.csv')

print('Number of rows:', hPrice.shape[0])


Number of rows: 10659


### Q2: Looks like there are some extra columns with row markers that appeared over the cleaning process.  How many "actual" columns are there?


In [2]:
print('Number of columns:', hPrice.shape[1])

Number of columns: 13


### Q3: Why are the dates just numbers?  Why is this ok?
The dates are just numbers because it makes it easier to perform calculations on (as they are numerical values). 

Also, we are able to analyze this data to see any seasonal trends and peaks of when people buy houses if the dates are set to numerical values.

## Step 2:
### Q1: Create your X and Y for the model.  We are predicting the town, so including the closest university would make the test too easy, so exclude that column, and any record IDs.

In [3]:
# Create X and Y data sets
Y = hPrice['Town']

X = hPrice[['Sale_amount', 'Sale_date', 'Beds', 'Baths', 'Sqft_home', 'Sqft_lot', 'Type', 'Build_year']]


### Q2: Split into A Training and Test Set


In [4]:
from sklearn.model_selection import train_test_split

# split the data: 80% training and 20% testing
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2, random_state=0)


### Q3: Train the model

In [5]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import cross_val_score

# create a KNN model and use cross validation to evaluate it
knn = KNeighborsClassifier(n_neighbors=5)
knn_cv_score = cross_val_score(knn, X_train, Y_train, cv=5).mean()

# create a Decision Tree model and use cross validation to evaluate it
dt = DecisionTreeClassifier(max_depth=7)
dt_cv_score = cross_val_score(dt, X_train, Y_train, cv=5).mean()

# print the cross-validation scores
print("KNN cross-validation mean score:", knn_cv_score)
print("Decision Tree cross-validation mean score:", dt_cv_score)

KNN cross-validation mean score: 0.13486710695045606
Decision Tree cross-validation mean score: 0.26328177589532203


### Q4: Use the accuracy_score function to pick a good model

In [6]:
# use accuracy_score to pick the better model (that has a higher accuracy)
from sklearn.metrics import accuracy_score

# KNN Classifier
knn_1 = knn.fit(X_train, Y_train)
knn_train_acc = accuracy_score(Y_train, knn_1.predict(X_train))
knn_test_acc = accuracy_score(Y_test, knn_1.predict(X_test))
print("KNN Classifier Training Accuracy: ", knn_train_acc)
print("KNN Classifier Testing Accuracy: ", knn_test_acc)

# Decision Tree Classifier
dt_1 = dt.fit(X_train, Y_train)
dt_train_acc = accuracy_score(dt_1.predict(X_train), Y_train)
dt_test_acc = accuracy_score(Y_test, dt_1.predict(X_test))
print("Decision Tree Classifier Training Accuracy: ", dt_train_acc)
print("Decision Tree Classifier Testing Accuracy: ", dt_test_acc)



KNN Classifier Training Accuracy:  0.3667174856338689
KNN Classifier Testing Accuracy:  0.12148217636022514
Decision Tree Classifier Training Accuracy:  0.31101207927758884
Decision Tree Classifier Testing Accuracy:  0.25140712945590993


### Q5: Test out your one best model using your test set. 

In [7]:
# Pick the model with the highest testing accuracy
best_model = knn if knn_test_acc > dt_test_acc else dt

# Predict the target variable (Town) using the best model and the test set
Y_pred = best_model.predict(X_test)
print("The best model with the higher testing accuracy is:", best_model)

The best model with the higher testing accuracy is: DecisionTreeClassifier(max_depth=7)


Comparing the KNN model to the Decision tree model, the testing accuracy for the decision tree model was higher. However, realistically, this may not be the most accurate model since I chose the random_state as 0 and the max_depth of the decision tree to be 7, if the max_depth was at the optimal point (in an ideal situation), it would provide a better testing accuracy, but this model would be more complex and would take more time to output a result. Similarly, I chose the n_neighbours value for KNN as 5, ideally, choosing the optimal knn value would provide better testing accuracy.

### Q6: Present your findings.  Was your testing accuracy as good as your training?  What did you think happened?

In [10]:
#findings for the best model: Decision Tree Classifier

print("Decision Tree Classifier Training Accuracy:", dt_train_acc)
print("Decision Tree Classifier Testing Accuracy", dt_test_acc)

# Compare testing accuracy with training accuracy
print("Is Testing Accuracy as good as Training Accuracy? ", dt_test_acc == dt_train_acc)

# Explanation
if dt_test_acc == dt_train_acc:
    print("The testing accuracy is as good as the training accuracy, indicating that the model is not overfitting and can generalize well to new data.")
else:
    print("The testing accuracy is not as good as the training accuracy, indicating that the model is overfitting and may not generalize well to new data.")


Decision Tree Classifier Training Accuracy: 0.31101207927758884
Decision Tree Classifier Testing Accuracy 0.25140712945590993
Is Testing Accuracy as good as Training Accuracy?  False
The testing accuracy is not as good as the training accuracy, indicating that the model is overfitting and may not generalize well to new data.


Although the Decision Tree model is more accurate compared to the KNN model, the training accuracy of the 'best' model is still higher than the the testing accuracy. This would indicate that the model is overfitting to the training data. Hence, the model learned the random fluctuations in the training data instead of understanding the underlying pattern. So, the model may not generalize to the new data and perform well on it.

# Q7: What town do you think Lee's house is in?  Use your model to predict.

In [11]:
#new_home = [sale price, sale date, no. bedrooms, no. bath, sq ft of house, sq. ft. of lot,
#            type of house,year built]
new_home = [350000, 43201, 3, 2, 1450, 40000, 1, 1992,]

town = dt.predict([new_home])
print("The predicted town for Lee's house is:", town[0])

The predicted town for Lee's house is: 47


I chose the best model (decision tree classifier) to predict what town Lee's house is in, since it would provide a more accurate prediction of the town compared to KNN. Although, the data is overfitting for both models, the decision tree had a higher testing accuracy. 