# King County Houses Prices: 
## Neigborhoods Classification
<p>
In this notebook, I used an other dataset (SEA Building Energy Benchmarking (Source bellow)) which give us for each building GPS coords and the neighborhood (North, East, Ballard, Delridge, etc) .<br>
    I cleaned the dataset as part of a project for a data scientist training and got the idea using this to classify each King County Houses using a KNN classifier.<br>
    <br>
    It will maybe help improving algorithm performances for predicting house prices. <br>
   <br>
    <b>Results at the bottom of the notebook

### Importations

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
pd.set_option("display.max_rows", None)
pd.set_option("display.max_columns", None)
sns.set()

In [None]:
data = pd.read_csv("../input/housesalesprediction/kc_house_data.csv")

### Exploratory Functions

In [None]:
def describe_columns(df):
    desc_df = pd.DataFrame(index=df.columns, columns=['NaN count', 'NaN frequency (%)', 'Number of unique values'])
    desc_df['NaN count'] = df.isna().sum()
    desc_df['NaN frequency (%)'] = desc_df['NaN count']/df.shape[0]*100
    for column in df.columns:
        desc_df['Number of unique values'][column] = len(df[column].dropna().unique())
    return desc_df

def move_column(df, column_name, column_place):
    mvd_column = df.pop(column_name)
    df.insert(column_place, column_name, mvd_column)
    return df

def prop_nan(df):
    return (df.isna()).sum().sum()/df.size

def nan_map(df, save=False, filename='nan_location'):
    plt.figure(figsize=(20,10))
    sns.heatmap(df.isna())
    if save:
        plt.savefig(filename)
        
def corr_matrix(df, figsize=(30,20), maptype='heatmap', absolute=False, crit_value=None,
                annot=True, save=False, filename='corr_matrix'):
    
    matrix_corr = df.corr()
    
    if absolute:
        matrix_corr = matrix_corr.abs()
    if crit_value != None:
        matrix_corr = matrix_corr >= crit_value
    plt.figure(figsize=figsize)
    if maptype=='heatmap':
        sns.heatmap(matrix_corr, annot=annot)
    elif maptype=='clustermap':
        sns.clustermap(matrix_corr, annot=annot)
    
        
    if save:
        plt.savefig(filename)

In [None]:
df = data.copy()

### Columns descriptions

<p>
<b>id</b> - Unique ID for each home sold<br>
<b>date</b> - Date of the home sale<br>
<b>price</b> - Price of each home sold<br>
<b>bedrooms</b> - Number of bedrooms<br>
<b>bathrooms</b> - Number of bathrooms, where .5 accounts for a room with a toilet but no shower<br>
<b>sqft_living</b> - Square footage of the apartments interior living space<br>
<b>sqft_lot</b> - Square footage of the land space<br>
<b>floors</b> - Number of floors<br>
<b>waterfront</b> - A dummy variable for whether the apartment was overlooking the waterfront or not<br>
<b>view</b> - An index from 0 to 4 of how good the view of the property was<br>
<b>condition</b> - An index from 1 to 5 on the condition of the apartment,<br>
<b>grade</b> - An index from 1 to 13, where 1-3 falls short of building construction and design, 7 has an average level of construction and design, and 11-13 have a high quality level of construction and design.<br>
<b>sqft_above</b> - The square footage of the interior housing space that is above ground level<br>
<b>sqft_basement</b> - The square footage of the interior housing space that is below ground level<br>
<b>yr_built</b> - The year the house was initially built<br>
<b>yr_renovated</b> - The year of the house’s last renovation<br>
<b>zipcode</b> - What zipcode area the house is in<br>
<b>lat</b> - Lattitude<br>
<b>long</b> - Longitude<br>
<b>sqft_living15</b> - The square footage of interior housing living space for the nearest 15 neighbors<br>
<b>sqft_lot15</b> - The square footage of the land lots of the nearest 15 neighbors<br>

verified from 2 sources:<br>
https://www.slideshare.net/PawanShivhare1/predicting-king-county-house-prices<br>
https://rstudio-pubs-static.s3.amazonaws.com/155304_cc51f448116744069664b35e7762999f.htm<br>
    <p>

In [None]:
df.head()

### Scatter 2 numerical columns

In [None]:
def plot_2_features(df, x_name, y_name):
    plt.figure(figsize=(12,8))
    plt.scatter(df[x_name], df[y_name], s=2)
    plt.xlabel(x_name)
    plt.ylabel(y_name)

### Plot map with a numerical column

In [None]:
def plot_map_num(df, y_name, interquartile=True, v=None):
    plt.figure(figsize=(20,10))
    if v != None:
        vmin = v[0]
        vmax = v[1]
        points = plt.scatter(df['long'], df['lat'], c=df[y_name], cmap='jet', lw=0, s=2, vmin=vmin, vmax=vmax)
    elif interquartile:
        desc_df = df.describe()
        vmin = desc_df.loc['25%', y_name]
        vmax = desc_df.loc['75%', y_name]
        points = plt.scatter(df['long'], df['lat'], c=df[y_name], cmap='jet', lw=0, s=2, vmin=vmin, vmax=vmax)
    else:
        points = plt.scatter(df['long'], df['lat'], c=df[y_name], cmap='jet', lw=0, s=2)
    plt.colorbar(points)
    plt.xlabel('Long')
    plt.ylabel('Lat')

### Plot price map

In [None]:
plot_map_num(df, 'price', interquartile=True)

### Load dataset containing Neighborhoods with GPS coord

Source: https://www.kaggle.com/city-of-seattle/sea-building-energy-benchmarking#2015-building-energy-benchmarking.csv

Note: I loaded a cleaned version of the dataset that I made for a data-science online training. 

In [None]:
neighborhood_data = pd.read_csv('../input/sea-energy-building-benchmark/data_cleaned.csv')

Selecting only the intersting columns

In [None]:
neighborhood_df = neighborhood_data.copy()
neighborhood_df = neighborhood_df[['Latitude', 'Longitude', 'Neighborhood']]

In [None]:
neighborhood_df.head()

In [None]:
neighborhood_df['Neighborhood'].unique()

### Importing KNN, MinMaxScaler

In [None]:
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import confusion_matrix, classification_report

In [None]:
X = neighborhood_df.drop('Neighborhood', axis=1).values
y = neighborhood_df['Neighborhood'].values

Splitting Data

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

Made my own encoding class which is easy to use because I got some errors with LabelEncoder 

In [None]:
class Encoding:
    
    def __init__(self):
        self.dico = {}
        self.inv_dico = {}
        
    def fit(self, y):
        i=0
        for classe in pd.Series(y).unique():
            self.dico[classe] = i
            self.inv_dico[i] = classe
            i+=1
            
    def transform(self, y):
        return pd.Series(y).map(self.dico).values
    
    def inverse_transform(self, y):
        return pd.Series(y).map(self.inv_dico).values

### Using Neighborhoods datasets to train a model for predicting Neighborhood in df

In [None]:
scaler = MinMaxScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

encoder = Encoding()
encoder.fit(y_train)
y_train_coded = encoder.transform(y_train)
y_test_coded = encoder.transform(y_test)

KNeighborsClassifier with minimum optimization (maybe need more parameter or an other algorithm).<br> <b>Can be improved.

In [None]:
model = GridSearchCV(KNeighborsClassifier(), {'n_neighbors':range(1,11)})

<b>Fitting with training set  

In [None]:
model.fit(X_train_scaled, y_train_coded)

<b>Predicting results on the test set

In [None]:
y_pred = encoder.inverse_transform(model.predict(X_test_scaled))

<b>Score on the test set

In [None]:
model.score(X_test_scaled, y_test_coded)

### Confusion Matrix

In [None]:
plt.figure(figsize=(12,8))
sns.heatmap(confusion_matrix(y_test, y_pred), annot=True)

### Classification report

In [None]:
print(classification_report(y_test, y_pred))

<b>Adding a new column  Neighborhood for King County Houses

In [None]:
df['Neighborhood'] = encoder.inverse_transform(model.predict(scaler.transform(df[['lat', 'long']].values)))

### Plot map with a categorical column

In [None]:
def plot_map_categ(df, categ_column):
    plt.figure(figsize=(20,10))
    for classe in df[categ_column].sort_values().unique():
        df_classe = df[df[categ_column]==classe]
        plt.scatter(df_classe['long'], df_classe['lat'], lw=0, s=10, label=classe)
    plt.legend()
    plt.xlabel('Long')
    plt.ylabel('Lat')

### Neighborhood locations
<b>Note:</b> The Neighborhood dataset was covering a smaller area for the longitude
. So the mountain part may not be very accurate.  

In [None]:
plot_map_categ(df, 'Neighborhood')

### Boxplot function

In [None]:
def boxplot_groupes(df, categ_column, target_column, figsize=(20,10)):
    groupes = []
    for cat in list(df[categ_column].unique()):
        groupes.append(df[df[categ_column]==cat][target_column])

    medianprops = {'color':"black"}
    meanprops = {'marker':'o', 'markeredgecolor':'black',
                    'markerfacecolor':'firebrick'}

    plt.figure(figsize=figsize)
    plt.boxplot(groupes, labels=list(df[categ_column].unique()), showfliers=False, medianprops=medianprops, 
                    vert=False, patch_artist=True, showmeans=True, meanprops=meanprops)
    plt.ylabel(categ_column)
    plt.xlabel(target_column)

<b>Boxplot Neighborhood / price

In [None]:
boxplot_groupes(df, 'Neighborhood', 'price')

### Updated King County house prices dataSet with a 'Neighborhood' column

In [None]:
df.head()

## Conclusion:
<b>We can see some significative changes in terme of prices between neighborhoods. The model for predicting the neighborhood can be improved.