## House Prices Clustering

## Table of contents
* [1. Data import](#1.-Data-import)
* [2. Missing values](#2.-Missing-values)
* [3. Numerical data](#3.-Numerical-data)
* [4. Categorical features](#4.-Categorical-features)
* [5. Corrplot](#5.-Corrplot)
* [6. Final set of features](#6.-Final-set-of-features)
* [7. Target](#7.-Target)
* [8. Visualization](#8.-Visualization)
* [9. Clustering](#9.-Clustering)
    * [9.1 Silhouette plot](#9.1-Silhouette-plot)
    * [9.2 Important features on cluster plot](#9.2-Important-features-on-cluster-plot)

## 1. Data import

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from scipy.cluster.hierarchy import dendrogram, ward
from scipy.cluster.hierarchy import fcluster
from sklearn.metrics import silhouette_samples
from matplotlib import cm

In [None]:
data = pd.read_csv('../input/train.csv')

In [None]:
print(data.shape)
data.head()

## 2. Missing values

Here I plot missing values:

In [None]:
data_null = data.isna().sum()
plt.figure(figsize=(8,8))
data_null[data_null!=0].plot(kind='barh');

Four features have very few values - drop them for the first analysis. But this is obviously step - we have other missing values.Сonsider this problem from different angles (axes).

In [None]:
data.drop(['Id','Alley','PoolQC','Fence','MiscFeature'],axis=1, inplace=True)

What about FirelaceQu? There are a lot of missing values with this feature. If we consider this feature along with 'Fireplaces' - we can see that NaN values means lack of pool in this house.

In [None]:
data['FireplaceQu'].fillna('No pool',inplace=True)
data[['FireplaceQu','Fireplaces']].head(3)

Missing values in column LotFrontage I fill with median values.

In [None]:
data['LotFrontage'].fillna(data['LotFrontage'].median(),inplace=True)

Now, if we drop all NaN values from rows - we drop only 8% of data.

In [None]:
print(round(1-data.dropna().shape[0]/data.shape[0],4))
data.dropna(inplace=True)

## 3. Numerical data
Consider numerical data:

In [None]:
data.describe(percentiles=[0.1,0.25,0.5,0.75,0.9])

After this we can pick out features, which have one dominant value in the whole sample. These signs are not needed.

In [None]:
fig, axes = plt.subplots(nrows=3, ncols=3, figsize=(15,10))
sns.set(font_scale=2)
data['BsmtFinSF2'].hist(ax=axes[0,0]);
axes[0,0].set_title('BsmtFinSF2');
data['LowQualFinSF'].hist(ax=axes[0,1]);
axes[0,1].set_title('LowQualFinSF');
data['BsmtHalfBath'].hist(ax=axes[0,2]);
axes[0,2].set_title('BsmtHalfBath');
data['KitchenAbvGr'].hist(ax=axes[1,0]);
axes[1,0].set_title('KitchenAbvGr');
data['EnclosedPorch'].hist(ax=axes[1,1]);
axes[1,1].set_title('EnclosedPorch');
data['3SsnPorch'].hist(ax=axes[1,2]);
axes[1,2].set_title('3SsnPorch');
data['ScreenPorch'].hist(ax=axes[2,0]);
axes[2,0].set_title('ScreenPorch');
data['3SsnPorch'].hist(ax=axes[2,1]);
axes[2,1].set_title('3SsnPorch');
data['PoolArea'].hist(ax=axes[2,2]);
axes[2,2].set_title('PoolArea');
#data['MiscVal'].hist(ax=axes[1,4]);
#axes[1,4].set_title('MisVal');
plt.tight_layout();
data.drop(['BsmtFinSF2','LowQualFinSF','BsmtHalfBath','KitchenAbvGr',\
            'EnclosedPorch','3SsnPorch','ScreenPorch','PoolArea','MiscVal'],axis=1,inplace=True)

## 3. Categorical features
Now consider categorical features with same method:

In [None]:
data.describe(include='all')

In [None]:
sns.set()
lst_out = ['Utilities','LandSlope','Condition1','Condition2','BldgType','Street',\
           'RoofMatl','ExterCond','BsmtCond','Heating','CentralAir','Electrical',\
           'Functional','GarageQual','GarageCond','PavedDrive']#,'MSZoning']
sns.set(font_scale=2)
n_row = 4
n_col = 4
fig, axes = plt.subplots(nrows=n_row, ncols=n_col, figsize=(20,20))
for i in enumerate(lst_out):
    pd.value_counts(data[i[1]]).plot(kind='barh',ax=axes[i[0]//n_row,i[0]%n_col])
    axes[i[0]//n_row,i[0]%n_col].set_title(i[1]);
plt.tight_layout()

In [None]:
data_new = data.drop(['Utilities','LandSlope','Condition1','Condition2','BldgType','Street','RoofMatl','ExterCond','BsmtCond',\
'Heating','CentralAir','Electrical','Functional','GarageQual','GarageCond','PavedDrive','MSZoning','SaleType','SaleCondition',\
                     'LandContour','BsmtFinType2'],axis=1)

## 5. Corrplot
Let's consider classic - corrplot. Of course, special attention to target - SalePrice.

In [None]:
plt.figure(figsize=(25,20))
sns.set(font_scale=1)
sns.heatmap(data_new.corr(),annot=True);

Firstly, drop most correlated features:

In [None]:
data_new.drop(['TotalBsmtSF','GarageCars','GarageYrBlt','TotRmsAbvGrd'],axis=1,inplace=True)

Visualize correlation coefficients to target:

In [None]:
plt.figure(figsize=(10,10))
data_new.corr()['SalePrice'].plot(kind='barh');

## 6. Final set of features
Let's split features to numeric and categorical:

In [None]:
numer = set(data_new.corr()['SalePrice'].index)
categ = list(set(data_new.columns) - set(data_new.corr()['SalePrice'].index))

In [None]:
sns.set()
n_row = 5
n_col = 5
fig, axes = plt.subplots(nrows=n_row, ncols=n_col, figsize=(20,20))
sns.set(font_scale=2)
for i in enumerate(categ):
    pd.value_counts(data[i[1]]).plot(kind='barh',ax=axes[i[0]//n_row,i[0]%n_col])
    axes[i[0]//n_row,i[0]%n_col].set_title(i[1]);
plt.tight_layout()

## 7. Target
Now consider target hist. We'll find abnormal values, because many ml algorithms don't like abnormal. Plot hist and boxplot together.

In [None]:
fig = plt.figure(figsize=(20,10))
sns.set(font_scale=2)
ax1 = fig.add_subplot(2,3,1)
ax1.set_title('Most popular apps')
data_new['SalePrice'].hist(bins=20);
ax2 = fig.add_subplot(2,3,2)
ax2.set_title('All apps')
sns.boxplot(data_new['SalePrice'])
plt.tight_layout()

Here we can see that more than 350000 are only 4% of data. So, drop it by first analysis. 

In [None]:
print(pd.value_counts(data_new['SalePrice']<350000))
data_new = data_new[data_new['SalePrice']<350000]
target_val = data_new['SalePrice'].values

Now I will cut target to 5 parts and see to the distribution. So I will check significance of features to the target. <br>
We can see that 

In [None]:
cut_value = pd.cut(data_new['SalePrice'],5).values
data_new['SalePrice'] = cut_value
sns.set(font_scale=2)
n_row = 4
n_col = 4
categ_targ = set(categ) - set(['Exterior2nd','Neighborhood','Exterior1st'])
fig, axes = plt.subplots(nrows=n_row, ncols=n_col, figsize=(20,20))
sns.set(font_scale=2)
sns.axes_style("whitegrid")

for i in enumerate(categ_targ):
    qq=sns.countplot(data_new[i[1]],ax=axes[i[0]//n_row,i[0]%n_col],\
                     hue=data_new['SalePrice'])
    qq.legend_.remove()
plt.legend().set_title('')
plt.tight_layout()

In [None]:
data_new.head(2)

Here I will bring categories into a numerical format:

In [None]:
data_targ = data_new.copy()
for i in categ:
    data_targ[i] = data_targ[i].factorize()[0]
data_targ.head(3)

## 8. Visualization
We can see that the scale of the data varies greatly. Therefore normalize it:

In [None]:
from sklearn import preprocessing
from sklearn.preprocessing import Normalizer,MinMaxScaler, RobustScaler
data_targ.drop(['SalePrice'],axis=1,inplace=True)
nrm = Normalizer()
nrm.fit(data_targ)
normal_data = nrm.transform(data_targ)

Apply visualization method - TSNE for clustering our data.

In [None]:
from sklearn.manifold import TSNE
tsn = TSNE(random_state=20)
res_tsne = tsn.fit_transform(normal_data)
plt.figure(figsize=(8,8))
sns.scatterplot(res_tsne[:,0],res_tsne[:,1]);

## 9. Clustering
We got an interesting result! This plot does not look like a noise. Try to cluster the data:

In [None]:
link = ward(res_tsne)
vb = fcluster(link,t=300, criterion='distance')
fig = plt.figure(figsize=(25,25))
ax1 = fig.add_subplot(3,3,1)
pd.value_counts(vb).plot(kind='barh')
ax2 = fig.add_subplot(3,3,2)
axpl_2 = sns.scatterplot(x=res_tsne[:,0],y=res_tsne[:,1],hue=vb,palette="Set1");
axpl_2.legend_.remove()

Let's check the quality of clustering with dendrogram and silhouette plot:

In [None]:
sns.set(style='white')
plt.figure(figsize=(10,7))
#link = ward(res_tsne)
dendrogram(link)
ax = plt.gca()
bounds = ax.get_xbound()
ax.plot(bounds, [300,300],'--', c='k')
ax.plot(bounds,'--', c='k')
plt.show()

### 9.1 Silhouette plot

In [None]:
assign = vb
cluster_labels=np.unique(assign)
n_clusters = len(np.unique(assign))
silhouette_vals = silhouette_samples(res_tsne, assign, metric='euclidean')
y_ax_lower, y_ax_upper = 0, 0
yticks = []
plt.figure(figsize=(10,8))
for i , c in enumerate(cluster_labels):
        c_silhouette_vals = silhouette_vals[assign==c]
        c_silhouette_vals.sort()
        y_ax_upper += len(c_silhouette_vals)
        color = cm.jet(float(i) / n_clusters)
        plt.barh(range(y_ax_lower,y_ax_upper),
                c_silhouette_vals,height=1.0,edgecolor='none',color=color)
        yticks.append((y_ax_lower+y_ax_upper) / 2)
        y_ax_lower += len(c_silhouette_vals)
silhouette_avg = np.mean(silhouette_vals)

plt.axvline(silhouette_avg,color="red",linestyle= "--")
plt.yticks(yticks , cluster_labels + 1)
plt.ylabel ('Cluster')
plt.xlabel('Silhouette coefficient')

### 9.2 Important features on cluster plot
It seems that the choice of 6 clusters is optimal. <br>
Consider how displayed target with clusters.

In [None]:
plt.figure(figsize=(10,10))
sns.set(font_scale=1.5)
sns.scatterplot(x=res_tsne[:,0],y=res_tsne[:,1],\
                     hue=data_new['SalePrice'],s=70,palette="hot");#,palette="RdBu");

Then I choose most import features (by correlation coefficients): <br>
OverallQual, GrLivArea, 1stFlrSF, FullBath.

In [None]:
most_sign = ['OverallQual','GrLivArea','1stFlrSF','FullBath']
n_row = 2
n_col = 2
fig, axes = plt.subplots(nrows=n_row, ncols=n_col, figsize=(15,15))
sns.set(font_scale=1)
sns.axes_style("whitegrid")
for i in enumerate(most_sign):
    qq = sns.scatterplot(x=res_tsne[:,0],y=res_tsne[:,1],ax=axes[i[0]//n_row,i[0]%n_col],\
                     hue=data_new[i[1]],s=70,palette="RdBu");
plt.legend().set_title('')
plt.tight_layout()

Thank you for reading! I hope this kernel was useful for you. <br>
My other kernels: https://www.kaggle.com/nikitagrec/kernels