# Project: Classify House Prices
- Put houses in price groups and try to predict based on Latitude and Longitude
- That will show if the area is a good indicator of the house unit price

### Step 1: Import libraries

In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

In [29]:
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score, r2_score

### Step 2: Read the data
- Use Pandas [read_csv](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html) method to read **files/house_prices.csv**

In [4]:
data = pd.read_csv('files/house_prices.csv')
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 414 entries, 0 to 413
Data columns (total 7 columns):
 #   Column                        Non-Null Count  Dtype  
---  ------                        --------------  -----  
 0   Transaction                   414 non-null    float64
 1   House age                     414 non-null    float64
 2   Distance to MRT station       414 non-null    float64
 3   Number of convenience stores  414 non-null    int64  
 4   Latitude                      414 non-null    float64
 5   Longitude                     414 non-null    float64
 6   House unit price              414 non-null    float64
dtypes: float64(6), int64(1)
memory usage: 22.8 KB


### Step 3: Prepare data
- Create 15 bins of house prices
    - HINT: use [cut](https://pandas.pydata.org/docs/reference/api/pandas.cut.html) on the **'House unit price'** column with **bins=15** and assign the result to column **Class**.
    - Get the category codes by transforming column **Class** with **.cat.codes** and assign it to **Class id**

In [11]:
# Class column
# : for each price is putted in to the category.
# e.g. 37.9 is the element of the interval (36.907, 44.233]. 44 goes into the same category as 37.9.
data['Class'] = pd.cut(x=data['House unit price'],bins=15)
data.head()

Unnamed: 0,Transaction,House age,Distance to MRT station,Number of convenience stores,Latitude,Longitude,House unit price,Class
0,2012.917,32.0,84.87882,10,24.98298,121.54024,37.9,"(36.907, 44.233]"
1,2012.917,19.5,306.5947,9,24.98034,121.53951,42.2,"(36.907, 44.233]"
2,2013.583,13.3,561.9845,5,24.98746,121.54391,47.3,"(44.233, 51.56]"
3,2013.5,13.3,561.9845,5,24.98746,121.54391,54.8,"(51.56, 58.887]"
4,2012.833,5.0,390.5684,5,24.97937,121.54245,43.1,"(36.907, 44.233]"


In [12]:
# converting data['Clss'] as code.
# cat.code : categories of code
# e.g. Class (36.907, 44.233] has Class id 4.
data['Class id'] = data['Class'].cat.codes
data.head()

Unnamed: 0,Transaction,House age,Distance to MRT station,Number of convenience stores,Latitude,Longitude,House unit price,Class,Class id
0,2012.917,32.0,84.87882,10,24.98298,121.54024,37.9,"(36.907, 44.233]",4
1,2012.917,19.5,306.5947,9,24.98034,121.53951,42.2,"(36.907, 44.233]",4
2,2013.583,13.3,561.9845,5,24.98746,121.54391,47.3,"(44.233, 51.56]",5
3,2013.5,13.3,561.9845,5,24.98746,121.54391,54.8,"(51.56, 58.887]",6
4,2012.833,5.0,390.5684,5,24.97937,121.54245,43.1,"(36.907, 44.233]",4


### Step 4: Prepare training and test data
- Assign **X** be all the data (it is needed in final step)
- Assign **y** to be the **Class id** column.
- Use **train_test_split** with **test_size=0.15**

In [31]:
X = data.copy()
y = data['Class id']
X_train, X_valid, y_train, y_valid = train_test_split(X, y, random_state=42, test_size=.15)

In [35]:
#X_train.info()

### Step 5: Train a $k$-Neighbours Classifier
- Create a model to **KNeighborsClassifier()**
- Fit the model on **X_train[['Latitude', 'Longitude']]** and **y_train**
- Predict **X_test[['Latitude', 'Longitude']]** and assign it to **y_pred**
- Calculate the accuracy score

In [33]:
clf = KNeighborsClassifier()
clf.fit(X_train[['Latitude', 'Longitude']], y_train)
y_pred = clf.predict(X_valid[['Latitude', 'Longitude']])
accuracy_score(y_true=y_valid,
              y_pred=y_pred)
# If the clf predicts in a wrong category, it get penalized.
# So clf predicted 44 percent in right categories.

0.4444444444444444

### Step 6: Make prediction of categories
- Convert **y_pred** to a DataFrame
    - HINT: **df_pred = pd.DataFrame(y_pred, columns=['Pred cat'])**
- Get the middle value of the prediction category.
    - HINT: **df_pred['Pred'] = df_pred['Pred cat'].apply(lambda x: X_test['Class'].cat.categories[x].mid)**
- Calculate the **r2_score** of the predicted and real price **'House unit price'** of **X_test**

In [39]:
# Map categories in Class id to the middle price of it.
# Predict the category and map it to the middle value of Class
# and compare it tothe real unit price.
df_pred = pd.DataFrame(y_pred, columns=['Pred cat'])
#df_pred.head()

In [40]:
# Make the predicted value to be the middle value of the categories.
df_pred['Pred'] = df_pred['Pred cat'].apply(lambda x: X_valid['Class'].cat.categories[x].mid)

In [41]:
# Calculate the real score of our model
""" wrong
r2_score(y_true=y_valid,
        y_pred=df_pred['Pred cat'])
"""
r2_score(y_true=X_valid['House unit price'],
        y_pred=df_pred['Pred']) # r squared score

0.7039083923865217

In [None]:
# Does that mean this is better?
# NO, this is only predicting the price based on 
# Longitude and Latitude.

# this model will not work on arbitrary data in this local 환경.

# In the next session : Reinforcement Learning