### Dataset description

This dataset contains house sale prices for King County, which includes Seattle. Data includes homes sold between May 2014 and May 2015.

The dataset contains 21597 instances with 21 attributes.

Description attributes:
* **id**: Unique ID for each home sold
* **date**: Date of the home sale
* **price**: Price of each home sold
* **bedrooms**: Number of bedrooms
* **bathrooms**: Number of bathrooms, where .5 accounts for a room with a toilet but no shower
* **sqft_living**: Square footage of the apartments interior living space
* **sqft_lot**: Square footage of the land space
* **floors**: Number of floors
* **waterfront**: - A dummy variable for whether the apartment was overlooking the waterfront or not
* **view**: An index from 0 to 4 of how good the view of the property was
* **condition**: - An index from 1 to 5 on the condition of the apartment,
* **grade**: An index from 1 to 13, where 1-3 falls short of building construction and design, 7 has an average level of construction and design, and 11-13 have a high quality level of construction and design.
* **sqft_above**: The square footage of the interior housing space that is above ground level
* **sqft_basement**: The square footage of the interior housing space that is below ground level
* **yr_built**: The year the house was initially built
* **yr_renovated**: The year of the house’s last renovation **(attribute will be converted to a boolean value: "Was the house renovated? ")**
* **zipcode**: What zipcode area the house is in
* **lat**: Lattitude
* **long**: Longitude
* **sqft_living15**: The square footage of interior housing living space for the nearest 15 neighbors **(attribute will be omitted)**
* **sqft_lot15**: The square footage of the land lots of the nearest 15 neighbors **(attribute will be omitted)**

### Import libraries

In [None]:
import numpy as np 
import pandas as pd
import seaborn as sns
from matplotlib import pyplot as plt
from scipy import stats

import sklearn.cluster, sklearn.metrics
import scipy.spatial

#### Load data to DataFrame.


In [None]:
df = pd.read_csv('../input/housesalesprediction/kc_house_data.csv')
df.head()

#### Simple data preprocessing
* set index to *id*
* convert *date* to datetime format
* custom boolean attribute *is_renovated*  ("Was the house renovated?")
* drop attributes [*sqft_living15* , *sqft_lot15*]

In [None]:
df.set_index('id', inplace = True)
df['date'] = pd.to_datetime(df['date'])
df.loc[df['yr_renovated'] == 0, 'is_renovated'] = False
df.loc[df['yr_renovated'] != 0, 'is_renovated'] = True
df.drop(['sqft_living15', 'sqft_lot15'], axis=1, inplace=True)
df.head()

In [None]:
df.dtypes

In [None]:
df.describe()

Check,  if the data doesn't contain any NULL value.

In [None]:
df.isna().sum()

## EDA
### Visualization of house price distribution

In [None]:
plt.figure(figsize=(8,5))
sns.distplot(df["price"], bins = 200)
plt.xlim(0,2000000)
plt.xlabel("Price ($)")
plt.ylabel("Denstity")
plt.show()

### Correlation matrix 

In [None]:
corr_metrix = df.corr(method='spearman')
plt.figure(figsize=(10,10))
sns.heatmap(corr_metrix)
plt.show()

### Visualization which attributes have a high correlation with price

In [None]:
corr_price = df.corr(method='spearman')['price'].sort_values(ascending=False)
sns.barplot(x = corr_price.index, y = corr_price.array)
plt.xticks(rotation=90)
plt.show()

### Visualization of living space in relation to price 
Further was add color by number bathroom due to high correlation with price.

In [None]:
plt.figure(figsize=(8,5))
plot = sns.scatterplot(data=df, y='price', x='sqft_living', hue='bathrooms')
plt.xlabel("Living Space (sqft)")
plt.ylabel("Price ($)")
plt.show()

With a growing living space, house prices are also rising. This trend also applies to the number of bathrooms.

### Visualization of which day of the week was houses sold most

In [None]:
df['day_of_week'] = df['date'].dt.dayofweek
df.head()

In [None]:
day_of_week_names = ('Mon', 'Tue', 'Wed', 'Thu', 'Fri','Sat','Sun') 

plt.figure(figsize=(8,5))
sns.countplot(df['day_of_week'])
plt.xlabel('Day of Week')
plt.ylabel('Sold houses')
plt.xticks(np.arange(7),day_of_week_names)
plt.show()

As we expected, far fewer houses were sold over the weekend than on weekdays. The peak was on Tuesday and the following day number decreasing.

### Visualization of zip code location in relation to price

In [None]:
zip_price = df.groupby('zipcode').mean()['price']

plt.figure(figsize=(20,5))
sns.barplot(x = zip_price.index, y = zip_price.array)
plt.xticks(rotation=90)
plt.xlabel('Zip code')
plt.ylabel('Price ($)')
plt.show()

In [None]:
zip_price = df.groupby('zipcode').mean()['price'].sort_values(ascending=False)
zip_price

#### Find the most expensive and the cheapest based on zip code.
* the most expesnive location = 98039
* the cheapest location = 98002

### Comparison and visualization of two selected locations mentioned above.

In [None]:
df_high_price_by_zipcode = df.drop(["date"],axis = 1).loc[df["zipcode"] == 98039].mean()
df_low_price_by_zipcode = df.drop(["date"],axis = 1).loc[df["zipcode"] == 98002].mean()

df_compare = pd.concat([df_high_price_by_zipcode, df_low_price_by_zipcode], axis=1).rename(columns={0: '98039',1: '98002'})
df_compare.drop(["zipcode","lat","long","day_of_week","yr_renovated"], inplace = True)
df_compare

In [None]:
fig = plt.figure(figsize=(15,15))
fig.suptitle('Comparison of two zipcode location')
for i, item in enumerate(list(df_compare.index.values)):
    plt.subplot(4,4,i+1)
    tmp = pd.melt(df_compare.loc[[item]])
    sns.barplot(x = tmp["variable"], y = tmp["value"])
    plt.title(f"Mean: {item}")
    plt.xlabel('Zipcode')
    plt.ylabel('Value')

fig.tight_layout(pad=4.0)
plt.show()

As you can see in the graphs, most attributes are better for zipcode *98039*. Also, you can expect that region *98039* will be located on the coast, which may increase the price. Another important reason for the higher price may be a larger living space and land space. 

## Clustering

**Prices houses were categorized into 4 classes:**
* 0 - 321 999 = 1 (5404 instances)
* 322 000 - 450 999 = 2 (5464 instances)
* 451 000 - 644 999 = 3 (5332 instances)
* 645 000 and more = 4 (5413 instances)

In [None]:
df.loc[df['price'] <= 321999, 'price_cat'] = 1
df.loc[df['price'] >= 322000, 'price_cat'] = 2
df.loc[df['price'] >= 451000, 'price_cat'] = 3
df.loc[df['price'] >= 645000, 'price_cat'] = 4
df.head()

In [None]:
df['price_cat'].value_counts()

In [None]:
df_clustering = df.drop(['date','price','yr_renovated','day_of_week','lat','long'],axis=1)
df_clustering.head()

Omitted some unnecessary or redundant attributes.

In [None]:
normalized_df = (df_clustering-df_clustering.min())/(df_clustering.max()-df_clustering.min())
normalized_df['price_cat'] = df_clustering['price_cat']
normalized_df.head()

Data normalization to interval <0; 1>

In [None]:
normalized_df_no_target = normalized_df.drop('price_cat',axis = 1)

Create dataframe for clustering without target value.

#### K-Means clustering

In [None]:
clustering_scores = []
for k in range(2, 10):
    clustering = sklearn.cluster.KMeans(n_clusters=k).fit(normalized_df_no_target)
    clustering_scores.append({
        'k': k,
        'sse': clustering.inertia_,
        'silhouette': sklearn.metrics.silhouette_score(normalized_df_no_target, clustering.labels_)
    })
    
df_clustering_scores = pd.DataFrame.from_dict(clustering_scores, orient='columns')
df_clustering_scores = df_clustering_scores.set_index('k')
df_clustering_scores

Run clustering using K-means algorithm, with K in range (2, 10). We stored SSE and silhouette score, to select the best K.

In [None]:
sns.lineplot(x = df_clustering_scores.index, y = df_clustering_scores['sse'])
plt.show()

In [None]:
sns.lineplot(x= df_clustering_scores.index, y = df_clustering_scores['silhouette'])
plt.show()

The peak in silhouette score indicates the best K value should be 5. 

In [None]:
clustering = sklearn.cluster.KMeans(n_clusters=5).fit(normalized_df_no_target)
pd.Series(clustering.labels_).value_counts()

In [None]:
normalized_df['k_means_clusters'] = pd.Series(index=normalized_df.index, data=clustering.labels_)
normalized_df.head()

Add cluster number to dataframe. 

In [None]:
normalized_df['id'] = normalized_df.index
df_tmp_count = normalized_df.groupby(['k_means_clusters', 'price_cat']).id.count().reset_index(name='count')
df_tmp_count

In [None]:
sns.barplot(data=df_tmp_count , x='price_cat', y='count', hue='k_means_clusters')

As you can see, each of the clusters is connected or affected by *price_cat* .  


The green cluster represents mostly cheaper houses. On other hand, the purple, red, and orange clusters mostly represent the class of more expensive houses. The blue cluster is a bit specific due to a peak in the second price category. Let's look at a visualization for each attribute in the data frame.

In [None]:
normalized_df["constant"] = "Data"

Was created a temporary attribute for visualization.

In [None]:
fig = plt.figure(figsize=(20,40))
number_sample = 600
not_used_collumns = ['constant', 'price_cat', 'k_means_clusters', 'id']
values = normalized_df.drop(not_used_collumns, axis=1).columns
for i,item in enumerate(values):
    plt.subplot(8,2,i+1)
    sns.swarmplot(x=normalized_df['constant'][:number_sample], y=normalized_df[item][:number_sample], hue=normalized_df['k_means_clusters'][:number_sample])
    plt.title(f"Swarmplot for: {item}")
    plt.legend(loc='upper left')

fig.tight_layout(pad=3.0)
plt.show()

Only a sample of data was used for visualization. 

For each attribute is nicely seen, grouping clusterss.

#### Agglomerative clustering

In [None]:
clustering = sklearn.cluster.AgglomerativeClustering(n_clusters=5)
clustering.fit(normalized_df_no_target)
normalized_df['agglomerative_clusters'] = pd.Series(index=normalized_df.index, data=clustering.labels_)
normalized_df['agglomerative_clusters'].value_counts()

In [None]:
df_tmp_count = normalized_df.groupby(['agglomerative_clusters', 'price_cat']).id.count().reset_index(name='count')
sns.barplot(data=df_tmp_count , x='price_cat', y='count', hue='agglomerative_clusters')
plt.show()

The clustering looks similar to the previous k-means clustering. The purple and green clusters represent mostly cheaper houses. The blue, orange and red clusters predominantly represent more expensive houses. 

In [None]:
fig = plt.figure(figsize=(20,40))
number_sample = 600
not_used_collumns = ['constant', 'price_cat', 'k_means_clusters','agglomerative_clusters', 'id']
values = normalized_df.drop(not_used_collumns, axis=1).columns
for i,item in enumerate(values):
    plt.subplot(8,2,i+1)
    sns.swarmplot(x=normalized_df['constant'][:number_sample], y=normalized_df[item][:number_sample], hue=normalized_df['agglomerative_clusters'][:number_sample])
    plt.title(f"Swarmplot for: {item}")
    plt.legend(loc='upper left')

fig.tight_layout(pad=3.0)
plt.show()

Again, each cluster nicely separated instances for each attributes.