## PCA exercise

In this notebook, you'll apply PCA to both a dataset of fruit images and a "mystery" dataset to see what you can learn about either.

One of the takeaways of this exercise and those that follow is that, with real data, you often have to try multiple methods/techniques to understand your data!

### Getting our datasets

First, lets load in our datasets. You can start with the code provided here to load in your data in each of the exercise notebooks.

For our fruit data, we will load pictures of apples and pictures of bananas with JuliaDB.

In [None]:
using JuliaDB, MNIST 

In [None]:
fruit_table = loadtable(["../training/data/Apple_Golden_1.dat","../training/data/bananas.dat"]; delim = '\t',
              filenamecol=:apple => (x) -> x=="../training/data/Apple_Golden_1.dat" ? true : false)

And now let's convert the columns of this `JuliaDB` table to a `Matrix`.

In [None]:
matdata = hcat(columns(fruit_table)...)

In [None]:
rescale(A, dim::Integer=1) = (A .- mean(A, dim)) ./ max.(std(A, dim), eps())

In [None]:
fruit_data = rescale(matdata, 1)[:, 2:end]

Use the first column of `fruit_table` to create an `Array` called `fruit_labels` that stores the string "Apple" for entries that are `true` and "Banana" for entries that are `false`.

#### Solution:

The variable `fruit_data` is bound to an array with columns for each of 5 values describing our fruit images, and `fruit_labels` is an array storing labels for the corresponding images, indicating whether that picture was of an apple or a banana. We will use these labels to visualize how ML techniques change and preserve our data.

Our mystery dataset is pulled in as follows:

In [None]:
mystery_data, labels = traindata()

In [None]:
N = 2500
mystery_data = rescale(convert(Matrix{Float64}, mystery_data[:, 1:N])',1)
println(size(data))

In [None]:
mystery_labels = Int.(labels)[1:N];

This dataset has 2500 observations of 784 dimensions each as the variable `mystery_data`. The variable `mystery_labels` are the true groupings in this dataset, and we will to see if we find those true groupings organically.

### Explore the dataset

Before we apply PCA or any other techniques to our mystery data set, let's try to get a feel for what our data set looks like.

We have five features in our data set. Can we use the raw data for these five features to distinguish apples from bananas?

Plot each pairwise combination of features from the dataset to see if we can tell the difference between apples and bananas. (For example, plot height vs. width.)

You may want to use the `group = fruit_labels` keyword argument to `scatter` from the `Plots` package to visualize apple vs. banana data points.

#### Solution:

You should see that we should be able to use colors to distinguish apples and bananas, but let's use PCA to see if we can do any better!

### Apply Principal Components Analysis

Let's start by performing a Principle Components Analysis of our data. On our fruit data, use the function `fit` with the model `PCA` from the `MultivariateStats` pacakge to tell it to create a PCA model.

How many output dimensions do you need to describe your dataset with 90% accuracy? 95% accuracy? 99%?

#### Solution:

Let's keep as many dimensions as we need for at least 99%  predictive of our real dataset. This way we can lose a dimension without losing much explainability!

Transform `fruit_data` to `fruit_PCA` using the model you get from `fit`. Demonstrate that you're able to reconstruct `fruit_data` from `fruit_PCA`.

#### Solution

Overlay a plot of fruit_data and fruit_labels, for a pair of features, to see if your data has changed much during transformation and reconstruction.

For fun, see how this overlay changes as you decrease the number of output dimensions in your model.

#### Solution

Now that we have dimensionally reduced data, let's plot the principle components. 

When we explored our untransformed dataset, we tried plotting each possible pairing of features to determine how we might distinguish apples form bananas. We shouldn't have to do this brute force search of features now that we've done PCA.

Plot the principal component(s) most likely to show variation in the data.

#### Solution

### What about our secret data?

Let's now do the same analysis on our secret data.

Try this without specifying the number of output dimensions. How many do you get?

Then try with only two output dimensions? How high is your `principalratio`?

Are you able to learn much from PCA on this dataset?

#### Solution