# PCA for dimensionality reduction

Let us try to reduce the dimensions of the price/area data from the houses dataset.

We can start by loading this dataset as a table, grabbing the columns for square footage and prices of the houses, and storing them in an `Array`, `F`. To work properly with the methods we'll be using, `F` should contain `Float`s and should have one row for every feature/dimension and one column for every data point/sample.

As before, we'll load `JuliaDB` and filter out the data points for which no square footage was recorded.

In [None]:
using JuliaDB
using Plots; gr()

In [None]:
houses = loadtable("houses.csv")
filtered_houses = filter(x -> x > 0, houses; select = :sq__ft);

From `filtered_houses`, we'll use `select` to grab the columns of interest.

In [None]:
F = select(filtered_houses, (:sq__ft, :price))

For the purpose of processing the data in table `F`, we want to have this data stored in an `Array`. We'll convert `F` to an `Array` with the following bit of code:

In [None]:
Farray = hcat(columns(F)...)'

What did we just do here?

Let's take a moment to look at each of the functions composed in the above command, one at a time.

First, we have the `columns` command. This returns a `Tuple` storing an `Array` for each column in the input data structure.

In [None]:
columns(F)

In [None]:
columns(F)[1]

The `hcat` command will build a multidimensional array from a set of input `Vector`s, or  1D `Array`s. For example,

In [None]:
hcat([1, 2, 3], [4, 5, 6])

We want to use `hcat` to construct a 2D `Array` (or `Matrix`) from the `Vector`s that we get from `columns`.

However, if we try to run `hcat` directly on the output of `columns`, it won't do what we expect.

In [None]:
hcat(columns(F))

What we want is a 2D `Array` with 814 rows, but we've accidentally created a 1x1 `Array`.

The issue is that we need to pass **the components** of the output of `columns`, **not the output** of `columns`. We can think of the "splat" command, `...`, as unraveling the data structure it follows.

Compare the difference of the two calls to `hcat`:

In [None]:
hcat([[1, 2], [3, 4]])

In [None]:
hcat([[1, 2], [3, 4]]...)

Let's store the `Array` version of `F` in `F` itself:

In [None]:
F = hcat(columns(F)...)'

Recall how the data looks when we plot housing prices against square footage.

In [None]:
scatter(F[1,:],F[2,:], legend = false)
xlabel!("Square footage")
ylabel!("Housing prices")
title!("Housing prices vs. square footage")

We can use the `MultivariateStats` package to run PCA

In [None]:
# Pkg.add("MultivariateStats")
using MultivariateStats

Next we'll use `fit` to fit the model, but `fit` won't work on an `Array` `F` of `Int`s. Let's convert `F` to an `Array` of `Float`s.

In [None]:
F = convert(Array{Float64}, F)

Now use `fit` to fit the model

In [None]:
M = fit(PCA, F)

Note that you can choose the maximum dimension of the new space by setting `maxoutdim`, and you can change the method to, for example, `:svd` with the following syntax.

```julia
fit(PCA, F; maxoutdim = 1,method=:svd)
```

It seems like we only get one dimension with PCA! Let's use `transform` to map all of our 2D data in `F` to `1D` data with our model, `M`.

In [None]:
y = transform(M, F)

Let's use `reconstruct` to put our now 1D data, `y`, in a form that we can easily overlay (`Xr`) with our 2D data in `F` along the principle direction/component.

In [None]:
Xr = reconstruct(M, y)

And now we create that overlay, where we can see points along the principle component in red. 

(Each blue point maps uniquely to some red point!)

In [None]:
scatter(F[1,:],F[2,:], label = "Original data")
scatter!(Xr[1,:],Xr[2,:], label = "PCA data")
xlabel!("Square footage")
ylabel!("Housing prices")
title!("Housing data overlaid with reconstructed data from PCA")