## Clustering
Put simply, the task of clustering is to place observations that seem similar within the same cluster. Clustering is commonly used in two dimensional data where the goal is to create clusters based on coordinates. Here, we will use something similar. We will cluster houses based on their latitude-longitude locations using several different clustering methods.

NOTE: Maps are commented out to make sure file is not too large

In [11]:
#importing the packages that we will need
import Pkg
# Pkg.add("Clustering")
# Pkg.add("VegaLite")
# Pkg.add("JSON")

In [3]:
# Packages we will use throughout this notebook
using Clustering
using VegaLite
using VegaDatasets
using DataFrames
using Statistics
using JSON
using CSV
using Distances

We will start off by getting some data. We will use data of 20,000+ California houses dataset. We will then learn whether housing prices directly correlate with map location.

In [5]:
#download the data
download("https://raw.githubusercontent.com/ageron/handson-ml/master/datasets/housing/housing.csv","newhouses.csv")
houses = CSV.read("newhouses.csv", DataFrame)

Row,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value,ocean_proximity
Unnamed: 0_level_1,Float64,Float64,Float64,Float64,Float64?,Float64,Float64,Float64,Float64,String15
1,-122.23,37.88,41.0,880.0,129.0,322.0,126.0,8.3252,452600.0,NEAR BAY
2,-122.22,37.86,21.0,7099.0,1106.0,2401.0,1138.0,8.3014,358500.0,NEAR BAY
3,-122.24,37.85,52.0,1467.0,190.0,496.0,177.0,7.2574,352100.0,NEAR BAY
4,-122.25,37.85,52.0,1274.0,235.0,558.0,219.0,5.6431,341300.0,NEAR BAY
5,-122.25,37.85,52.0,1627.0,280.0,565.0,259.0,3.8462,342200.0,NEAR BAY
6,-122.25,37.85,52.0,919.0,213.0,413.0,193.0,4.0368,269700.0,NEAR BAY
7,-122.25,37.84,52.0,2535.0,489.0,1094.0,514.0,3.6591,299200.0,NEAR BAY
8,-122.25,37.84,52.0,3104.0,687.0,1157.0,647.0,3.12,241400.0,NEAR BAY
9,-122.26,37.84,42.0,2555.0,665.0,1206.0,595.0,2.0804,226700.0,NEAR BAY
10,-122.25,37.84,52.0,3549.0,707.0,1551.0,714.0,3.6912,261100.0,NEAR BAY


In [7]:
#pull the names
names(houses) # we will use longitude and latitude for clustering

10-element Vector{String}:
 "longitude"
 "latitude"
 "housing_median_age"
 "total_rooms"
 "total_bedrooms"
 "population"
 "households"
 "median_income"
 "median_house_value"
 "ocean_proximity"

We will use the `VegaLite` package here for plotting. This package makes it very easy to plot information on a map. All you need is a JSON file of the map you intend to draw. Here, we will use the California counties JSON file and plot each house on the map and color code it via a heatmap of the price. This is done by this line `color="median_house_value:q"`

In [29]:
#=

#plot will take a long time to load
cali_shape = JSON.parsefile("data/california-counties.json") #load json california shape
# you will generally need a JSON file if you want to plot other countries / regions etc.
VV = VegaDatasets.VegaJSONDataset(cali_shape,"data/california-counties.json")

@vlplot(width=500, height=300) +
@vlplot(
    mark={
        :geoshape,
        fill=:black, #black background
        stroke=:white
    },
    data={
        values=VV,
        format={
            type=:topojson,
            feature=:cb_2015_california_county_20m
        }
    },
    projection={type=:albersUsa},
)+ # as of here, this is plotting the actual scatter
@vlplot(
    :circle,
    data=houses,
    projection={type=:albersUsa},
    longitude="longitude:q", #specific longitude
    latitude="latitude:q", #specific latitude
    size={value=12},
    color="median_house_value:q"
                    
)
#houses in the inside are a little cheaper

=#

In [18]:
names(houses)

10-element Vector{String}:
 "longitude"
 "latitude"
 "housing_median_age"
 "total_rooms"
 "total_bedrooms"
 "population"
 "households"
 "median_income"
 "median_house_value"
 "ocean_proximity"

Note that the cell above may take a few minutes to run!

One thing we will try and explore in this notebook is if clustering the houses has any direct relationship with their prices, so we will bucket the houses into intervals of $50000 and re perform the color codes based on each bucket.

In [31]:
#=
bucketprice = Int.(div.(houses[!,:median_house_value],50000)) #you can bucket houses based on 50k, giving us around 10 buckets
insertcols!(houses,3,:cprice=>bucketprice) #we insert the bucket prices in here

@vlplot(width=500, height=300) +
@vlplot(
    mark={
        :geoshape,
        fill=:black,
        stroke=:white
    },
    data={
        values=VV,
        format={
            type=:topojson,
            feature=:cb_2015_california_county_20m
        }
    },
    projection={type=:albersUsa},
)+
@vlplot(
    :circle,
    data=houses,
    projection={type=:albersUsa},
    longitude="longitude:q",
    latitude="latitude:q",
    size={value=12},
    color="cprice:n" #we are using the bucket prices here
                    
)
=#

### 🟤K-means clustering

In [9]:
X = houses[!, [:latitude,:longitude]] # get latitude and longitude and put it into dataframe
C = kmeans(Matrix(X)', 10)  #run kmeans on the data but as matrix
# we need it to transpose though because kmeans expects every column to be one house
# we use 10 clusters
insertcols!(houses,3,:cluster10=>C.assignments) #added cluster10 back to the dataset

Row,longitude,latitude,cluster10,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value,ocean_proximity
Unnamed: 0_level_1,Float64,Float64,Int64,Float64,Float64,Float64?,Float64,Float64,Float64,Float64,String15
1,-122.23,37.88,7,41.0,880.0,129.0,322.0,126.0,8.3252,452600.0,NEAR BAY
2,-122.22,37.86,7,21.0,7099.0,1106.0,2401.0,1138.0,8.3014,358500.0,NEAR BAY
3,-122.24,37.85,7,52.0,1467.0,190.0,496.0,177.0,7.2574,352100.0,NEAR BAY
4,-122.25,37.85,7,52.0,1274.0,235.0,558.0,219.0,5.6431,341300.0,NEAR BAY
5,-122.25,37.85,7,52.0,1627.0,280.0,565.0,259.0,3.8462,342200.0,NEAR BAY
6,-122.25,37.85,7,52.0,919.0,213.0,413.0,193.0,4.0368,269700.0,NEAR BAY
7,-122.25,37.84,7,52.0,2535.0,489.0,1094.0,514.0,3.6591,299200.0,NEAR BAY
8,-122.25,37.84,7,52.0,3104.0,687.0,1157.0,647.0,3.12,241400.0,NEAR BAY
9,-122.26,37.84,7,42.0,2555.0,665.0,1206.0,595.0,2.0804,226700.0,NEAR BAY
10,-122.25,37.84,7,52.0,3549.0,707.0,1551.0,714.0,3.6912,261100.0,NEAR BAY


In [33]:
#=
@vlplot(width=500, height=300) +
@vlplot(
    mark={
        :geoshape,
        fill=:black,
        stroke=:white
    },
    data={
        values=VV,
        format={
            type=:topojson,
            feature=:cb_2015_california_county_20m
        }
    },
    projection={type=:albersUsa},
)+
@vlplot(
    :circle,
    data=houses,
    projection={type=:albersUsa},
    longitude="longitude:q",
    latitude="latitude:q",
    size={value=12},
    color="cluster10:n" #using k-means data
                    
)
=#

Yes, location affects price of the house but this means location as in proximity to water, prosimity to downtown, promisity to a bus stop and so on

lets' see if this remains true for the rest.

### 🟤K-medoids clustering
For this type of clustering, we need to build a distance matrix. We will use the `Distances` package for this purpose and compute the pairwise Euclidean distances.

In [11]:
xmatrix = Matrix(X)' #get the transpose
D = pairwise(Euclidean(), xmatrix, xmatrix,dims=2) #get a distance matrix

K = kmedoids(D,10) #getting 10 clusters
insertcols!(houses,3,:medoids_clusters=>K.assignments) #and add a new column

Row,longitude,latitude,medoids_clusters,cluster10,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value,ocean_proximity
Unnamed: 0_level_1,Float64,Float64,Int64,Int64,Float64,Float64,Float64?,Float64,Float64,Float64,Float64,String15
1,-122.23,37.88,5,7,41.0,880.0,129.0,322.0,126.0,8.3252,452600.0,NEAR BAY
2,-122.22,37.86,5,7,21.0,7099.0,1106.0,2401.0,1138.0,8.3014,358500.0,NEAR BAY
3,-122.24,37.85,5,7,52.0,1467.0,190.0,496.0,177.0,7.2574,352100.0,NEAR BAY
4,-122.25,37.85,5,7,52.0,1274.0,235.0,558.0,219.0,5.6431,341300.0,NEAR BAY
5,-122.25,37.85,5,7,52.0,1627.0,280.0,565.0,259.0,3.8462,342200.0,NEAR BAY
6,-122.25,37.85,5,7,52.0,919.0,213.0,413.0,193.0,4.0368,269700.0,NEAR BAY
7,-122.25,37.84,5,7,52.0,2535.0,489.0,1094.0,514.0,3.6591,299200.0,NEAR BAY
8,-122.25,37.84,5,7,52.0,3104.0,687.0,1157.0,647.0,3.12,241400.0,NEAR BAY
9,-122.26,37.84,5,7,42.0,2555.0,665.0,1206.0,595.0,2.0804,226700.0,NEAR BAY
10,-122.25,37.84,5,7,52.0,3549.0,707.0,1551.0,714.0,3.6912,261100.0,NEAR BAY


In [35]:
#=
@vlplot(width=500, height=300) +
@vlplot(
    mark={
        :geoshape,
        fill=:black,
        stroke=:white
    },
    data={
        values=VV,
        format={
            type=:topojson,
            feature=:cb_2015_california_county_20m
        }
    },
    projection={type=:albersUsa},
)+
@vlplot(
    :circle,
    data=houses,
    projection={type=:albersUsa},
    longitude="longitude:q",
    latitude="latitude:q",
    size={value=12},
    color="medoids_clusters:n"
                    
)
=#

### 🟤Hierarchial Clustering

In [13]:
K = hclust(D) #passing the distance matrix
L = cutree(K;k=10) #forming the tree 
insertcols!(houses,3,:hclust_clusters=>L) #we create a new data frame and add the hclust column

Row,longitude,latitude,hclust_clusters,medoids_clusters,cluster10,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value,ocean_proximity
Unnamed: 0_level_1,Float64,Float64,Int64,Int64,Int64,Float64,Float64,Float64?,Float64,Float64,Float64,Float64,String15
1,-122.23,37.88,1,5,7,41.0,880.0,129.0,322.0,126.0,8.3252,452600.0,NEAR BAY
2,-122.22,37.86,1,5,7,21.0,7099.0,1106.0,2401.0,1138.0,8.3014,358500.0,NEAR BAY
3,-122.24,37.85,1,5,7,52.0,1467.0,190.0,496.0,177.0,7.2574,352100.0,NEAR BAY
4,-122.25,37.85,1,5,7,52.0,1274.0,235.0,558.0,219.0,5.6431,341300.0,NEAR BAY
5,-122.25,37.85,1,5,7,52.0,1627.0,280.0,565.0,259.0,3.8462,342200.0,NEAR BAY
6,-122.25,37.85,1,5,7,52.0,919.0,213.0,413.0,193.0,4.0368,269700.0,NEAR BAY
7,-122.25,37.84,1,5,7,52.0,2535.0,489.0,1094.0,514.0,3.6591,299200.0,NEAR BAY
8,-122.25,37.84,1,5,7,52.0,3104.0,687.0,1157.0,647.0,3.12,241400.0,NEAR BAY
9,-122.26,37.84,1,5,7,42.0,2555.0,665.0,1206.0,595.0,2.0804,226700.0,NEAR BAY
10,-122.25,37.84,1,5,7,52.0,3549.0,707.0,1551.0,714.0,3.6912,261100.0,NEAR BAY


In [37]:
#=
@vlplot(width=500, height=300) +
@vlplot(
    mark={
        :geoshape,
        fill=:black,
        stroke=:white
    },
    data={
        values=VV,
        format={
            type=:topojson,
            feature=:cb_2015_california_county_20m
        }
    },
    projection={type=:albersUsa},
)+
@vlplot(
    :circle,
    data=houses,
    projection={type=:albersUsa},
    longitude="longitude:q",
    latitude="latitude:q",
    size={value=12},
    color="hclust_clusters:n"
                    
)
=#

### 🟤DBscan

In [21]:
?dbscan

search: [0m[1md[22m[0m[1mb[22m[0m[1ms[22m[0m[1mc[22m[0m[1ma[22m[0m[1mn[22m bswap isnan



```
dbscan(points::AbstractMatrix, radius::Real;
       [metric=Euclidean()],
       [min_neighbors=1], [min_cluster_size=1],
       [nntree_kwargs...]) -> DbscanResult
```

Cluster `points` using the DBSCAN (Density-Based Spatial Clustering of Applications with Noise) algorithm.

## Arguments

  * `points`: when `metric` is specified, the *d×n* matrix, where each column is a *d*-dimensional coordinate of a point; when `metric=nothing`, the *n×n* matrix of pairwise distances between the points
  * `radius::Real`: neighborhood radius; points within this distance are considered neighbors

Optional keyword arguments to control the algorithm:

  * `metric` (defaults to `Euclidean()`): the points distance metric to use, `nothing` means `points` is the *n×n* precalculated distance matrix
  * `min_neighbors::Integer` (defaults to 1): the minimal number of neighbors required to assign a point to a cluster "core"
  * `min_cluster_size::Integer` (defaults to 1): the minimal number of points in a cluster; cluster candidates with fewer points are discarded
  * `nntree_kwargs...`: parameters (like `leafsize`) for the `KDTree` constructor

## Example

```julia
points = randn(3, 10000)
# DBSCAN clustering, clusters with less than 20 points will be discarded:
clustering = dbscan(points, 0.05, min_neighbors = 3, min_cluster_size = 20)
```

## References:

  * Martin Ester, Hans-Peter Kriegel, Jörg Sander, and Xiaowei Xu, *"A density-based algorithm for discovering clusters in large spatial databases with noise"*, KDD-1996, pp. 226–231.
  * Erich Schubert, Jörg Sander, Martin Ester, Hans Peter Kriegel, and Xiaowei Xu, *"DBSCAN Revisited, Revisited: Why and How You Should (Still) Use DBSCAN"*, ACM Transactions on Database Systems, Vol.42(3)3, pp. 1–21, https://doi.org/10.1145/3068335


In [23]:
using Distances
dclara = pairwise(SqEuclidean(), Matrix(X)',dims=2)
L = dbscan(dclara, 0.05, 10) #thresholds
@show length(unique(L.assignments)) #the clusters that we obtained

length(unique(L.assignments)) = 15


15

In [25]:
insertcols!(houses,3,:dbscanclusters3=>L.assignments)

Row,longitude,latitude,dbscanclusters3,hclust_clusters,medoids_clusters,cluster10,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value,ocean_proximity
Unnamed: 0_level_1,Float64,Float64,Int64,Int64,Int64,Int64,Float64,Float64,Float64?,Float64,Float64,Float64,Float64,String15
1,-122.23,37.88,1,1,5,7,41.0,880.0,129.0,322.0,126.0,8.3252,452600.0,NEAR BAY
2,-122.22,37.86,1,1,5,7,21.0,7099.0,1106.0,2401.0,1138.0,8.3014,358500.0,NEAR BAY
3,-122.24,37.85,1,1,5,7,52.0,1467.0,190.0,496.0,177.0,7.2574,352100.0,NEAR BAY
4,-122.25,37.85,1,1,5,7,52.0,1274.0,235.0,558.0,219.0,5.6431,341300.0,NEAR BAY
5,-122.25,37.85,1,1,5,7,52.0,1627.0,280.0,565.0,259.0,3.8462,342200.0,NEAR BAY
6,-122.25,37.85,1,1,5,7,52.0,919.0,213.0,413.0,193.0,4.0368,269700.0,NEAR BAY
7,-122.25,37.84,1,1,5,7,52.0,2535.0,489.0,1094.0,514.0,3.6591,299200.0,NEAR BAY
8,-122.25,37.84,1,1,5,7,52.0,3104.0,687.0,1157.0,647.0,3.12,241400.0,NEAR BAY
9,-122.26,37.84,1,1,5,7,42.0,2555.0,665.0,1206.0,595.0,2.0804,226700.0,NEAR BAY
10,-122.25,37.84,1,1,5,7,52.0,3549.0,707.0,1551.0,714.0,3.6912,261100.0,NEAR BAY


In [39]:
#=
@vlplot(width=500, height=300) +
@vlplot(
    mark={
        :geoshape,
    
        fill=:black,
        stroke=:white
    },
    data={
        values=VV,
        format={
            type=:topojson,
            feature=:cb_2015_california_county_20m
        }
    },
    projection={type=:albersUsa},
)+
@vlplot(
    :circle,
    data=houses,
    projection={type=:albersUsa},
    longitude="longitude:q",
    latitude="latitude:q",
    size={value=12},
    color="dbscanclusters3:n"
                    
)
=#

# Finally...
After finishing this notebook, you should be able to:
- [ ] run kmeans clustering on your data
- [ ] run kmedoids clustering on your data
- [ ] run hierarchial clustering on your data
- [ ] run DBscan clustering on your data
- [ ] modify a dataframe and add a new named column
- [ ] generate good looking plots of maps using the VegaLite package

# 🥳 One cool finding

Prices in California do not seem to have an exact mapping with geographical locations. In specifc, performing a clustering algorithm on the houses dataset we had did not reveal a mapping with the price ranges. This indicate that prices relationship to geographical location is not necessairly based on neighborhood but probably other factors like closeness to the water or closeness to a downtown.