# Kmeans Clustering

In this notebook, we'll practice kmeans clustering on the housing data.

In addition to `JuliaDB` and `Plots`, we will load the `Clustering` package to do this.

In [None]:
using JuliaDB, Plots; gr()

In [None]:
using Clustering

For this exercise, we're interested in seeing how where the houses in our dataset fall on a spatial map. So, we'll grab their latitudes and longitudes with `select` and the column symbols, `:latitude` and `:longitude`.

In [None]:
houses = loadtable("houses.csv")
locations = select(houses, (:latitude, :longitude))

As in the last demo for PCA, we want our data in an `Array` to be compatible with our kmeans implementation. We'll convert `locations` to an `Array` with the following code (see notebook 4 if you forgot what this means):

In [None]:
locations = hcat(columns(locations)...)

At this point, each data point is stored as a row of `locations`, but we can transpose `locations` to make these samples correspond to columns of `locations`.

In [None]:
locations = locations'

As a first pass at guessing how many clusters we might need, let's use the number of zip codes in our data.

(Try changing this to see how it impacts results!)

In [None]:
k = length(unique(select(houses, :zip)) )

We can use the `kmeans` function to do kmeans clustering!

In [None]:
C = kmeans(locations, k)

Now let's create a new table, `clustered_houses`, with all the same data as `filtered_houses` that also includes a column for the cluster to which each house has been assigned.

Our output from `kmeans()` has an `assigments` field that stores the cluster assignments for each data point.

In [None]:
C.assignments

To create a new table from `houses` with a column for these cluster assignments, use `setcol`. Here `setcol` takes as inputs

* the name of the table to add to
* the name of the new column to add
* the data for the new column to add

In [None]:
clustered_houses = setcol(houses, :cluster, C.assignments )

Let's plot each cluster as a different color.

In [None]:
clusters_figure = plot()
for i = 1:k
    houses_in_cluster_i = filter(x -> x == i, clustered_houses, select = :cluster)
    xvals = select(houses_in_cluster_i, :latitude)
    yvals = select(houses_in_cluster_i, :longitude)
    scatter!(xvals, yvals, markersize=4)
end
xlabel!("Latitude")
ylabel!("Longitude")
title!("Houses color-coded by cluster")
display(clusters_figure)

And now let's try coloring them by zip code.

In [None]:
unique_zips = unique(select(houses, :zip))
zips_figure = plot()
for uzip in unique_zips
    subs = filter(x -> x == uzip, houses, select = :zip)
    x = select(subs, :latitude)
    y = select(subs, :longitude)
    scatter!(zips_figure,x,y)
end
xlabel!("Latitude")
ylabel!("Longitude")
title!("Houses color-coded by zip code")
display(zips_figure)

Let's see the two plots side by side.

In [None]:
plot(clusters_figure,zips_figure, layout=(2, 1), legend = false)

It's not an exact match, but there are some structural similarities! Now we know that ZIP codes are not randomly assigned. :)