In [1]:
import pandas as pd
import numpy as np

### Step 1: Import Data 

The "real_estate_valuation_data_set.csv" file saves the market historical data of real estate valuation of a Chinese city. The variable information is as follows 

| Variable Name | Role | Type | Description | Units | 
|:--|:--|:--|:--|:--|
|No|ID|Integer||||
|X1 transaction date|Feature|Continuous|for example, 2013.250=2013 March, <br>2013.500=2013 June, etc.||
|X2 house age|Feature|Continuous||year|
|X3 distance to the <br>nearest metro station|Feature|Continuous||meter|
|X4 number of <br>convenience stores|Feature|Integer|number of convenience stores in the<br> living circle on foot|integer|
|X5 latitude|Feature|Continuous|geographic coordinate, latitude|degree|
|X6 longitude|Feature|Continuous|geographic coordinate, longitude|degree|
|Y house price of<br> unit area|Target|Continuous|10000 Chinese Yuan per <br>square metre|10000 CNY/<br>square metre|

In [2]:
# Read the data with pandas
real_estate = pd.read_csv("real_estate_valuation_data_set.csv", index_col=0)

In [3]:
# Show the first five rows of the data
real_estate.head()

Unnamed: 0_level_0,X1 transaction date,X2 house age,X3 distance to the nearest MRT station,X4 number of convenience stores,X5 latitude,X6 longitude,Y house price of unit area
No,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
1,2012.917,32.0,84.87882,10,24.98298,121.54024,2.53
2,2012.917,19.5,306.5947,9,24.98034,121.53951,2.81
3,2013.583,13.3,561.9845,5,24.98746,121.54391,3.15
4,2013.5,13.3,561.9845,5,24.98746,121.54391,3.65
5,2012.833,5.0,390.5684,5,24.97937,121.54245,2.87


In [4]:
# Get the number of instances and features
row_num = real_estate.shape[0]
col_num = real_estate.shape[1]

In [5]:
row_num, col_num

(414, 7)

In [6]:
# Get the input variables X with features from X1 to X6.
X = np.array(real_estate.iloc[:,:6].values)

In [7]:
X.shape

(414, 6)

In [8]:
X

array([[2012.917  ,   32.     ,   84.87882,   10.     ,   24.98298,
         121.54024],
       [2012.917  ,   19.5    ,  306.5947 ,    9.     ,   24.98034,
         121.53951],
       [2013.583  ,   13.3    ,  561.9845 ,    5.     ,   24.98746,
         121.54391],
       ...,
       [2013.25   ,   18.8    ,  390.9696 ,    7.     ,   24.97923,
         121.53986],
       [2013.     ,    8.1    ,  104.8101 ,    5.     ,   24.96674,
         121.54067],
       [2013.5    ,    6.5    ,   90.45606,    9.     ,   24.97433,
         121.5431 ]], shape=(414, 6))

In [9]:
# Get the target values y
y = np.array(real_estate.iloc[:,6].values)

In [10]:
y.shape

(414,)

In [11]:
y

array([2.53, 2.81, 3.15, 3.65, 2.87, 2.14, 2.69, 3.11, 1.25, 1.47, 2.76,
       3.87, 2.62, 1.59, 2.29, 3.37, 4.67, 2.49, 2.82, 3.18, 1.95, 3.44,
       1.64, 3.19, 2.59, 1.8 , 3.75, 2.24, 3.13, 3.81, 1.47, 1.67, 2.28,
       3.29, 3.67, 1.82, 1.53, 1.69, 3.18, 3.08, 1.06, 1.21, 2.31, 2.27,
       3.59, 2.55, 2.8 , 4.1 , 0.89, 0.88, 2.95, 1.38, 1.8 , 2.59, 3.45,
       0.91, 2.79, 3.57, 1.51, 2.83, 1.42, 4.21, 1.85, 3.67, 1.69, 2.95,
       3.38, 3.79, 2.41, 2.8 , 3.93, 2.72, 2.42, 1.33, 3.63, 1.97, 2.45,
       1.71, 1.99, 1.77, 2.69, 2.45, 3.21, 1.18, 2.91, 3.39, 1.8 , 1.22,
       3.2 , 1.69, 3.03, 2.88, 1.45, 1.07, 2.73, 3.45, 3.97, 2.31, 3.4 ,
       4.15, 2.55, 2.19, 3.63, 3.05, 2.03, 4.73, 3.14, 1.77, 2.27, 1.89,
       3.44, 2.63, 1.54, 0.51, 3.55, 3.09, 0.81, 0.87, 2.04, 3.97, 2.09,
       3.2 , 2.17, 3.03, 3.83, 3.24, 4.19, 3.67, 4.05, 2.73, 2.5 , 2.05,
       2.5 , 2.63, 2.81, 1.39, 3.12, 3.16, 2.9 , 2.83, 3.43, 1.93, 2.5 ,
       2.67, 1.89, 3.03, 3.48, 2.88, 3.01, 2.65, 3.

We can transform the continuous target variable (house price) into a categorical variable (house price range) according to the following rule.
|price range|category|label|
|:--:|:--:|:--:|
|$y\leq 1.5$|very low|0|
|$1.5<y\leq2.5$|low|1|
|$2.5<y\leq3.5$|high|2|
|$y>3.5$|very high|3|

In [12]:
y_ = np.zeros(row_num, dtype="int")
y_[y<=1.5] = 0
y_[(y>1.5)*(y<=2.5)] = 1
y_[(y>2.5)*(y<=3.5)] = 2
y_[y>3.5] = 3
y_

array([2, 2, 2, 3, 2, 1, 2, 2, 0, 0, 2, 3, 2, 1, 1, 2, 3, 1, 2, 2, 1, 2,
       1, 2, 2, 1, 3, 1, 2, 3, 0, 1, 1, 2, 3, 1, 1, 1, 2, 2, 0, 0, 1, 1,
       3, 2, 2, 3, 0, 0, 2, 0, 1, 2, 2, 0, 2, 3, 1, 2, 0, 3, 1, 3, 1, 2,
       2, 3, 1, 2, 3, 2, 1, 0, 3, 1, 1, 1, 1, 1, 2, 1, 2, 0, 2, 2, 1, 0,
       2, 1, 2, 2, 0, 0, 2, 2, 3, 1, 2, 3, 2, 1, 3, 2, 1, 3, 2, 1, 1, 1,
       2, 2, 1, 0, 3, 2, 0, 0, 1, 3, 1, 2, 1, 2, 3, 2, 3, 3, 3, 2, 1, 1,
       1, 2, 2, 0, 2, 2, 2, 2, 2, 1, 1, 2, 1, 2, 2, 2, 2, 2, 2, 2, 1, 2,
       0, 0, 0, 1, 2, 1, 3, 2, 0, 3, 3, 1, 3, 2, 1, 1, 0, 3, 3, 1, 2, 1,
       0, 2, 1, 2, 0, 3, 1, 0, 0, 0, 1, 0, 2, 0, 2, 2, 2, 2, 1, 1, 1, 2,
       2, 1, 1, 2, 1, 2, 1, 0, 2, 1, 1, 2, 2, 2, 1, 3, 0, 2, 2, 2, 2, 2,
       3, 2, 2, 2, 2, 2, 0, 2, 2, 0, 1, 0, 0, 1, 1, 2, 3, 2, 2, 1, 1, 2,
       1, 2, 0, 2, 2, 1, 0, 0, 1, 0, 3, 1, 2, 0, 1, 2, 3, 1, 1, 1, 3, 1,
       2, 2, 1, 2, 2, 1, 3, 1, 2, 1, 2, 2, 1, 1, 2, 1, 2, 2, 1, 1, 1, 3,
       3, 1, 2, 2, 1, 3, 1, 2, 2, 0, 1, 1, 0, 2, 1,

### Step 2: Data Preprocessing

The values of features X1-X6 are in different scales. The feature in a large scale would have a huge influence on label prediction. We shall normalize the values of different features into a uniform scale to balance the contributions of different features to label prediction. Here we use the [MinMaxScaler](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MinMaxScaler.html) to scale the feature values into the range of $[0,1]$. Another choice for feature normalization is [StandardScaler](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html).

In [13]:
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
scaler.fit(X)
X = scaler.transform(X)

In [14]:
X

array([[0.27292576, 0.73059361, 0.00951267, 1.        , 0.61694135,
        0.71932284],
       [0.27292576, 0.44520548, 0.04380939, 0.9       , 0.5849491 ,
        0.71145137],
       [1.        , 0.30365297, 0.08331505, 0.5       , 0.67123122,
        0.75889584],
       ...,
       [0.63646288, 0.42922374, 0.05686115, 0.7       , 0.57149782,
        0.71522536],
       [0.36353712, 0.18493151, 0.0125958 , 0.5       , 0.42014057,
        0.72395946],
       [0.90938865, 0.14840183, 0.0103754 , 0.9       , 0.51211827,
        0.75016174]], shape=(414, 6))

### Step 3: Train-Test Split

We randomly split the data X and y_ into training and test sets with the ratio $1:1$.

In [15]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(
    X, y_, test_size=0.5, random_state=0)
train_num = y_train.shape[0]
test_num = y_test.shape[0]

### Step 4: House Price Range Prediction with K Nearest Neighbors

**Task 1**: Implement K Nearest Neighbors Classifier and predict the house price range of test house samples. **Do not** use Scikit-Learn's K Nearest Neighbors implementation. Set the parameter K to 5 and use Euclidean distance as distance metric. 

In [16]:
# Predict the labels of test data by implementing the K Nearest Neighbors algorithm
from collections import Counter

def euclidean_distance(x1, x2):
    return np.sqrt(np.sum((x1 - x2) ** 2))

def knn_predict(X_train, y_train, X_test, k=5):
    predictions = []

    for test_sample in X_test:
        distances = []

        for i in range(len(X_train)):
            dist = euclidean_distance(test_sample, X_train[i])
            distances.append((dist, y_train[i]))

        k_neighbors = sorted(distances, key=lambda x: x[0])[:k]
        k_labels = [label for _, label in k_neighbors]

        most_common = Counter(k_labels).most_common(1)[0][0]
        predictions.append(most_common)

    return np.array(predictions)

y_pred = knn_predict(X_train, y_train, X_test)

In [17]:
# Compare the ground-truth and predicted labels
pd.DataFrame({'y_test': y_test, 'y_pred': y_pred})

Unnamed: 0,y_test,y_pred
0,2,1
1,0,0
2,2,2
3,0,0
4,2,2
...,...,...
202,1,1
203,1,1
204,1,2
205,2,3


In [18]:
# Compute the accuracy score as an evaluation of the implemented K Nearest Neighbors algorithm
acc = (y_test == y_pred).mean()
acc

np.float64(0.6714975845410628)

**Task 2**: Read [Scikit-Learn's KNeighborsClassifier API](https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html). Use [Scikit-Learn's KNeighborsClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html) to predict house price range for test house samples and compare the result with your implementation. Set the parameter K to 5 and use Euclidean distance as distance metric. 

In [19]:
from sklearn.neighbors import KNeighborsClassifier
# Train the KNeighborsClassifier with the training data (X_train, y_train)
knn = KNeighborsClassifier(n_neighbors=5, metric='euclidean')
knn.fit(X_train, y_train)

In [20]:
# Predict the class labels of test data
y_pred = knn.predict(X_test)

In [21]:
# Compare the ground-truth and predicted class labels
pd.DataFrame({'y_test': y_test, 'y_pred': y_pred})

Unnamed: 0,y_test,y_pred
0,2,1
1,0,0
2,2,2
3,0,0
4,2,2
...,...,...
202,1,1
203,1,1
204,1,2
205,2,2


In [22]:
# Compute the accuracy score as an evaluation of the trained KNeighborsClassifier
acc = knn.score(X_test, y_test)
acc

0.6811594202898551

**Futher Exploration (Optional)**:
+ Change the [KNeighborsClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html)'s "weight" parameter into "distance" and evaluate the accuracy change. 
+ Read [Scikit-Learn's RadiusNeighborsClassifier API](https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.RadiusNeighborsClassifier.html), and use it to predict the house price range of test house samples.