[back](./03-datamining-fundamentals.ipynb)

---
## `Cluster Analysis`

- This is considered as **unsupervised learning**, meaning we don't have the labels.
- We will be having some features, but we might not be sure how it is classified.

### `Initial Setup`

In [1]:
# Importing required libraries

import pandas as pd
import numpy as np
import seaborn as sns  # Library used to print nicer charts and visualizations
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans # Used to perform cluster analysis

%matplotlib inline


In [2]:
df = pd.read_csv(r'../../assets/single_family_home_values.csv')
df.head(4)


Unnamed: 0,id,address,city,state,zipcode,latitude,longitude,bedrooms,bathrooms,rooms,squareFootage,lotSize,yearBuilt,lastSaleDate,lastSaleAmount,priorSaleDate,priorSaleAmount,estimated_value
0,39525749,8171 E 84th Ave,Denver,CO,80022,39.84916,-104.893468,3,2.0,6,1378,9968,2003.0,2009-12-17,75000,2004-05-13,165700.0,239753
1,184578398,10556 Wheeling St,Denver,CO,80022,39.88802,-104.83093,2,2.0,6,1653,6970,2004.0,2004-09-23,216935,,,343963
2,184430015,3190 Wadsworth Blvd,Denver,CO,80033,39.76171,-105.08107,3,1.0,0,1882,23875,1917.0,2008-04-03,330000,,,488840
3,155129946,3040 Wadsworth Blvd,Denver,CO,80033,39.76078,-105.08106,4,3.0,0,2400,11500,1956.0,2008-12-02,185000,2008-06-27,0.0,494073


In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 15000 entries, 0 to 14999
Data columns (total 18 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   id               15000 non-null  int64  
 1   address          15000 non-null  object 
 2   city             15000 non-null  object 
 3   state            15000 non-null  object 
 4   zipcode          15000 non-null  int64  
 5   latitude         14985 non-null  float64
 6   longitude        14985 non-null  float64
 7   bedrooms         15000 non-null  int64  
 8   bathrooms        15000 non-null  float64
 9   rooms            15000 non-null  int64  
 10  squareFootage    15000 non-null  int64  
 11  lotSize          15000 non-null  int64  
 12  yearBuilt        14999 non-null  float64
 13  lastSaleDate     15000 non-null  object 
 14  lastSaleAmount   15000 non-null  int64  
 15  priorSaleDate    11173 non-null  object 
 16  priorSaleAmount  11287 non-null  float64
 17  estimated_va

In [4]:
df.describe()

Unnamed: 0,id,zipcode,latitude,longitude,bedrooms,bathrooms,rooms,squareFootage,lotSize,yearBuilt,lastSaleAmount,priorSaleAmount,estimated_value
count,15000.0,15000.0,14985.0,14985.0,15000.0,15000.0,15000.0,15000.0,15000.0,14999.0,15000.0,11287.0,15000.0
mean,51762290.0,80204.919467,39.740538,-104.964076,2.7084,2.195067,6.164133,1514.5044,5820.7662,1929.517168,405356.3,259435.0,637162.5
std,61908760.0,9.715263,0.023555,0.039788,0.897231,1.166279,1.958601,830.635999,3013.27947,29.937051,775699.8,337938.7,504418.5
min,143367.0,80022.0,39.614531,-105.10844,0.0,0.0,0.0,350.0,278.0,1874.0,259.0,0.0,147767.0
25%,10048020.0,80205.0,39.727634,-104.978737,2.0,1.0,5.0,986.0,4620.0,1907.0,194000.0,110000.0,398434.8
50%,25632410.0,80206.0,39.748048,-104.957689,3.0,2.0,6.0,1267.5,5950.0,1925.0,320000.0,210000.0,518357.5
75%,51142220.0,80207.0,39.758214,-104.937522,3.0,3.0,7.0,1766.25,6270.0,1949.0,463200.0,330240.0,687969.2
max,320948100.0,80209.0,39.88802,-104.83093,15.0,12.0,39.0,10907.0,122839.0,2016.0,45600000.0,16000000.0,10145310.0


### `Preparing Feature Dataset`

Consider we are trying to guess the **estimated_value**, so, this data set is considered as supervised as we already know the outcome

So, we will reposition the dataset to predict the estimated_value based on the other features

In [5]:
# X will be all the columns, without estimated_value
X = df.drop('estimated_value', axis=1)

We can now create a very simple clustering analysis based on this

In [6]:
X = X[['bedrooms', 'bathrooms', 'rooms', 'squareFootage', 'lotSize', 'yearBuilt', 'priorSaleAmount']]
X.head()

Unnamed: 0,bedrooms,bathrooms,rooms,squareFootage,lotSize,yearBuilt,priorSaleAmount
0,3,2.0,6,1378,9968,2003.0,165700.0
1,2,2.0,6,1653,6970,2004.0,
2,3,1.0,0,1882,23875,1917.0,
3,4,3.0,0,2400,11500,1956.0,0.0
4,3,4.0,8,2305,5600,1998.0,0.0


In [7]:
X.fillna(0, inplace=True)
X.head()

Unnamed: 0,bedrooms,bathrooms,rooms,squareFootage,lotSize,yearBuilt,priorSaleAmount
0,3,2.0,6,1378,9968,2003.0,165700.0
1,2,2.0,6,1653,6970,2004.0,0.0
2,3,1.0,0,1882,23875,1917.0,0.0
3,4,3.0,0,2400,11500,1956.0,0.0
4,3,4.0,8,2305,5600,1998.0,0.0


### `Cluster Analysis using ` **`KMeans`**

We have now prepared our `X`, the feature set, so we can proceed to create some clusters using `KMeans`

> **Parameters of `KMeans()` from Original Documentation :**
>
> **`n_clusters`** : int, default=8
>    The number of clusters to form as well as the number of centroids to generate.
>
> **`init`** : {'k-means++', 'random'}, callable or array-like of shape (n_clusters, n_features), default='k-means++'
>    Method for initialization:
> - 'k-means++' : selects initial cluster centers for k-mean clustering in a smart way to speed up convergence. See section Notes in k_init for more details.
>
> - 'random': choose `n_clusters` observations (rows) at random from data for the initial centroids.
>   If an array is passed, it should be of shape (n_clusters, n_features) and gives the initial centers.
>   If a callable is passed, it should take arguments X, n_clusters and a random state and return an initialization.
>
> **`n_init`** : int, default=10
>    Number of time the k-means algorithm will be run with different centroid seeds. The final results will be the best output of n_init consecutive runs in terms of inertia.
>
> **`max_iter`** : int, default=300
>    Maximum number of iterations of the k-means algorithm for a single run.
>
> **`tol`** : float, default=1e-4
>    Relative tolerance with regards to Frobenius norm of the difference in the cluster centers of two consecutive iterations to declare convergence.
>
> **`precompute_distances`** : {'auto', True, False}, default='auto'
>    Precompute distances (faster but takes more memory).
> - 'auto' : do not precompute distances if n_samples * n_clusters > 12 million. This corresponds to about 100MB overhead per job using double precision.
> - True : always precompute distances.
> - False : never precompute distances.
>
> **`verbose`** : int, default=0
>    Verbosity mode.
>
> **`random_state`** : int, RandomState instance or None, default=None
>    Determines random number generation for centroid initialization. Use an int to make the randomness deterministic. See Glossary <random_state>.
>
> **`copy_x`** : bool, default=True
>    When pre-computing distances it is more numerically accurate to center the data first. If copy_x is True (default), then the original data is not modified. If False, the original data is modified, and put back before the function returns, but small numerical differences may be introduced by subtracting and then adding the data mean. Note that if the original data is not C-contiguous, a copy will be made even if copy_x is False. If the original data is sparse, but not in CSR format, a copy will be made even if copy_x is False.
>
> **`n_jobs`** : int, default=None
>    The number of OpenMP threads to use for the computation. Parallelism is sample-wise on the main cython loop which assigns each sample to its closest center.
>    `None` or `-1` means using all processors.
>
> **`algorithm`** : {"auto", "full", "elkan"}, default="auto"
>    K-means algorithm to use. The classical EM-style algorithm is "full". The "elkan" variation is more efficient on data with well-defined clusters, by using the triangle inequality. However it's more memory intensive due to the allocation of an extra array of shape (n_samples, n_clusters).
>    For now "auto" (kept for backward compatibiliy) chooses "elkan" but it might change in the future for a better heuristic.

In [8]:
kmeans = KMeans(n_clusters=5, random_state=0).fit(X)
kmeans

KMeans(n_clusters=5, random_state=0)

Understanding the properties of a `KMeans` instance

In [9]:
# labels_
kmeans.labels_

array([1, 1, 1, ..., 0, 4, 0], dtype=int32)

In [10]:
# should be the same length as the DataSet, 7 is the number of columns
len(kmeans.labels_), X.shape

(15000, (15000, 7))

`kmeans.cluster_centers_`: This will provide the inform as to where is the center of each cluster, based on the number of features

In [11]:
kmeans.cluster_centers_


array([[3.31226296e+00, 3.83944374e+00, 8.42730721e+00, 2.69720607e+03,
        6.97174968e+03, 1.94200506e+03, 7.43586930e+05],
       [2.64078392e+00, 1.93525180e+00, 5.86293724e+00, 1.39300918e+03,
        5.94409712e+03, 1.93060060e+03, 3.93563157e+04],
       [3.00000000e+00, 4.50000000e+00, 9.00000000e+00, 3.74800000e+03,
        8.59750000e+03, 1.99800000e+03, 1.37500550e+07],
       [3.73118280e+00, 5.64516129e+00, 1.04408602e+01, 4.51996774e+03,
        1.30122688e+04, 1.96766667e+03, 2.37729552e+06],
       [2.70373430e+00, 2.27247191e+00, 6.20290813e+00, 1.47484848e+03,
        5.39461203e+03, 1.92551404e+03, 2.93157062e+05]])

In [12]:
# shape of the cluster
# number of clusters and features
kmeans.cluster_centers_.shape

(5, 7)

Now, we can get the labels from the cluster and append it to our feature dataset `X`

In [13]:
labels = kmeans.labels_
X['cluster'] = labels
X.head()

Unnamed: 0,bedrooms,bathrooms,rooms,squareFootage,lotSize,yearBuilt,priorSaleAmount,cluster
0,3,2.0,6,1378,9968,2003.0,165700.0,1
1,2,2.0,6,1653,6970,2004.0,0.0,1
2,3,1.0,0,1882,23875,1917.0,0.0,1
3,4,3.0,0,2400,11500,1956.0,0.0,1
4,3,4.0,8,2305,5600,1998.0,0.0,1


Using this, now we can get the closer value to the cluster center by `grouping by cluster` and getting the `mean`, `median`, `max`, `min` etc

In [14]:
X.groupby('cluster').mean()

Unnamed: 0_level_0,bedrooms,bathrooms,rooms,squareFootage,lotSize,yearBuilt,priorSaleAmount
cluster,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
0,3.312263,3.839444,8.427307,2697.206068,6971.749684,1942.005057,743586.9
1,2.641226,1.935344,5.863614,1393.248821,5944.416232,1930.599901,39293.21
2,3.0,4.5,9.0,3748.0,8597.5,1998.0,13750060.0
3,3.731183,5.645161,10.44086,4519.967742,13012.268817,1967.666667,2377296.0
4,2.703104,2.272127,6.201783,1474.475561,5394.550363,1925.518329,293073.4


In [15]:
X.groupby('cluster').median()


Unnamed: 0_level_0,bedrooms,bathrooms,rooms,squareFootage,lotSize,yearBuilt,priorSaleAmount
cluster,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
0,3.0,4.0,8.0,2582.0,6250.0,1927.0,651500.0
1,2.0,2.0,5.0,1133.0,6236.5,1928.5,0.0
2,3.0,4.5,9.0,3748.0,8597.5,1998.0,13750055.0
3,4.0,6.0,10.0,4424.0,8580.0,1989.0,2200000.0
4,3.0,2.0,6.0,1327.0,5210.0,1923.0,279900.0


In [16]:
X.groupby('cluster').max()

Unnamed: 0_level_0,bedrooms,bathrooms,rooms,squareFootage,lotSize,yearBuilt,priorSaleAmount
cluster,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
0,13,12.0,39,8456,30200,2016.0,1550000.0
1,9,9.0,22,10907,122839,2016.0,166200.0
2,3,5.0,10,4141,13279,2002.0,16000000.0
3,15,9.0,20,9394,97125,2016.0,5000000.0
4,10,11.0,21,7004,23700,2016.0,518000.0


In [17]:
X.groupby('cluster').min()


Unnamed: 0_level_0,bedrooms,bathrooms,rooms,squareFootage,lotSize,yearBuilt,priorSaleAmount
cluster,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
0,0,0.0,0,662,1626,1879.0,519000.0
1,0,0.0,0,350,278,0.0,0.0
2,3,4.0,8,3355,3916,1994.0,11500110.0
3,1,1.0,4,772,4078,1887.0,1580000.0
4,1,0.0,0,517,1175,1874.0,166331.0


We'll try to generate clusters in loop to see the best fit by checking the score from a library from `sklearn`

In [18]:
from sklearn.metrics import silhouette_score

In [19]:
X.drop('cluster', axis=1)

Unnamed: 0,bedrooms,bathrooms,rooms,squareFootage,lotSize,yearBuilt,priorSaleAmount
0,3,2.0,6,1378,9968,2003.0,165700.0
1,2,2.0,6,1653,6970,2004.0,0.0
2,3,1.0,0,1882,23875,1917.0,0.0
3,4,3.0,0,2400,11500,1956.0,0.0
4,3,4.0,8,2305,5600,1998.0,0.0
...,...,...,...,...,...,...,...
14995,4,4.0,8,2169,4950,1922.0,0.0
14996,3,3.0,11,2937,4500,1890.0,557500.0
14997,3,5.0,7,2937,4680,2007.0,1208214.0
14998,3,4.0,10,3193,4970,2005.0,405000.0


Trying to find the best number of cluster initiation using `Silhouette Score`

In [20]:
for i in range(3, 10):
  kmeans = KMeans(n_clusters=i).fit(X)
  labels = kmeans.labels_
  print(silhouette_score(X, labels=labels))


0.6145595723146733
0.6166029879402746
0.6329117382160774
0.6434385964750505
0.6592385754572507
0.6604722058741889
0.6569891107864054



---
[next](./03B-classification-and-regression.ipynb)