## Part 1 - Initial data prep section.  Read, clean and create sets.

In [1]:
import pandas as pd

data_frame = pd.read_csv("Housing_data.gitignored/realtor-data.csv")
data_frame = data_frame.dropna(thresh=5)
filtered_data = data_frame[data_frame["bed"] < 10]
filtered_data = filtered_data[filtered_data["bath"] < 10]
filtered_data = filtered_data[filtered_data["acre_lot"] < 3]
filtered_data = filtered_data[filtered_data["house_size"] < 5000]
filtered_data = filtered_data[filtered_data["price"] < 2000000]
filtered_data = filtered_data[filtered_data["state"] == "Delaware"]

def get_name(value):
    if value <= 250000: return "Low"
    if value <= 500000: return "Mid"
    if value <= 2000000: return "High"

filtered_data['price_status'] = filtered_data['price'].map(get_name)
print(filtered_data["price"].value_counts)

Pyarrow will become a required dependency of pandas in the next major release of pandas (pandas 3.0),
(to allow more performant data types, such as the Arrow string type, and better interoperability with other libraries)
but was not found to be installed on your system.
If this would cause problems for you,
please provide us feedback at https://github.com/pandas-dev/pandas/issues/54466
        
  import pandas as pd


<bound method IndexOpsMixin.value_counts of 612232     68900.0
612240    199900.0
612256     85000.0
612284    150000.0
612290    121000.0
            ...   
685965    465500.0
685983    450000.0
686230    154900.0
688769    249900.0
688887    239900.0
Name: price, Length: 1650, dtype: float64>


In [2]:
import numpy as np
def fractional_split(filtered_data, test_fraction=0.2, seed=42):
    data_count = len(filtered_data)
    test_count = int(test_fraction*data_count)
    
    np.random.seed(seed)
    shuffled_indices = np.random.permutation(data_count)
    
    test_indices = shuffled_indices[:test_count]
    train_indices = shuffled_indices[test_count:]
    
    return filtered_data.iloc[train_indices], filtered_data.iloc[test_indices]

train_set, test_set = fractional_split(filtered_data)

The initial set of features for X will be bath and house size, while the target value of y will be the price of the house. In order to get the confusion matrix to look nice I split up the house prices into 3 categories called price_status, The reason I chose the x values are because they are the most correlated with the price of the house and we are trying to predict the selling prices of houses

## Part 3 - Do a decision tree on  X and y.  Compute metrics.

In [3]:
from sklearn.tree import DecisionTreeClassifier
train_data = train_set.copy()
X = train_data[['house_size', 'bath']]
y = train_data['price_status']

tree_classifier = DecisionTreeClassifier()
tree_classifier.fit(X,y)

In [4]:
from sklearn.metrics import confusion_matrix
y_predicted = tree_classifier.predict(X)
matrix = confusion_matrix(y, y_predicted)
print(matrix)

[[139  10  10]
 [  0 551  28]
 [ 12  82 488]]


In [5]:
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
print ("Accuracy is ", accuracy_score(y, y_predicted))
print ("Precision is ", precision_score(y, y_predicted, average="weighted"))
print ("Sensitivity is ", recall_score(y, y_predicted, average="weighted"))
print ("F1 is ", f1_score(y, y_predicted, average="weighted"))

Accuracy is  0.8924242424242425
Precision is  0.8958149145753468
Sensitivity is  0.8924242424242425
F1 is  0.8919649592631513


Honestly the confusion matrix doesn't look too bad, and the data is mostly evenly split amounst the 3 categories, a 89% is pretty good

## Part 5 - See if you can do better using SVM or some other multi-classifier.

In [6]:
from sklearn.svm import SVC
X = train_data[['house_size', 'bath']]
y = train_data['price_status']

svm_classifier = SVC(kernel = 'rbf')
svm_classifier.fit(X,y)

In [7]:
from sklearn.metrics import confusion_matrix
y_predicted = svm_classifier.predict(X)
matrix = confusion_matrix(y, y_predicted)
print(matrix)

[[ 42   8 109]
 [  4 455 120]
 [ 19  91 472]]


In [8]:
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
print ("Accuracy is ", accuracy_score(y, y_predicted))
print ("Precision is ", precision_score(y, y_predicted, average="weighted"))
print ("Sensitivity is ", recall_score(y, y_predicted, average="weighted"))
print ("F1 is ", f1_score(y, y_predicted, average="weighted"))

Accuracy is  0.7340909090909091
Precision is  0.7349586496690299
Sensitivity is  0.7340909090909091
F1 is  0.7218833892124273


rbf is the best I got so far and it only got a 73% compaired to the decision tree 89%, so i'm going to use the decision tree for the test set

## Part 6 -  Do a final evaluation with the test set.

In [9]:
from sklearn.tree import DecisionTreeClassifier
test_data = test_set.copy()
X = test_data[['house_size', 'bath']]
y = test_data['price_status']

tree_classifier = DecisionTreeClassifier()
tree_classifier.fit(X,y)

In [10]:
from sklearn.metrics import confusion_matrix
y_predicted = tree_classifier.predict(X)
matrix = confusion_matrix(y, y_predicted)
print(matrix)

[[ 49   0   3]
 [  0 135   2]
 [  2  10 129]]


In [11]:
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
print ("Accuracy is ", accuracy_score(y, y_predicted))
print ("Precision is ", precision_score(y, y_predicted, average="weighted"))
print ("Sensitivity is ", recall_score(y, y_predicted, average="weighted"))
print ("F1 is ", f1_score(y, y_predicted, average="weighted"))

Accuracy is  0.9484848484848485
Precision is  0.9492464073388054
Sensitivity is  0.9484848484848485
F1 is  0.9482714463179808


The training set got an 89% while the test got a 94% which is pretty good, honestly, I didn't expect this dataset to perform this good looking at the initial exploration and the linear regression that i ran on it but it turned out much better than i expected