## PCI stands for principal component analysis
- It derives an orthogonal projection to convert a given set of observations to linearly uncorrelated variables 
- each projection is called principal components

## import required modules for this tutorial

In [1]:
import Plots
import MultivariateStats
import Clustering
import RDatasets, Plots
Plots.plotly()

┌ Info: For saving to png with the Plotly backend PlotlyBase has to be installed.
└ @ Plots C:\Users\bark4\.julia\packages\Plots\iYDwd\src\backends.jl:372


Plots.PlotlyBackend()

## load iris dataset

In [2]:
iris = RDatasets.dataset("datasets", "iris");

## split half to training set

In [3]:
# training set
Xtr        = Array(iris[1:2:end,1:4])';
Xtr_labels = Array(iris[1:2:end,5]);

In [4]:
Xtr

4×75 adjoint(::Matrix{Float64}) with eltype Float64:
 5.1  4.7  5.0  4.6  4.4  5.4  4.8  5.8  …  6.3  6.0  6.7  5.8  6.7  6.3  6.2
 3.5  3.2  3.6  3.4  2.9  3.7  3.0  4.0     3.4  3.0  3.1  2.7  3.3  2.5  3.4
 1.4  1.3  1.4  1.4  1.4  1.5  1.4  1.2     5.6  4.8  5.6  5.1  5.7  5.0  5.4
 0.2  0.2  0.2  0.3  0.2  0.2  0.1  0.2     2.4  1.8  2.4  1.9  2.5  1.9  2.3

## split other half to testing set

In [5]:
Xte = Array(iris[2:2:end,1:4])';
Xte_labels = Array(iris[2:2:end,5]);

### suppose Xtr and Xte are training and testing data matrix
### with each observation in a column
### train a PCA model, allowing up to 3 dimensions

In [19]:
M = MultivariateStats.fit(MultivariateStats.PCA, Xtr; maxoutdim=4)

PCA(indim = 4, outdim = 3, principalratio = 0.9957325846529409)

### Here, I put maximum output dimension is 4 but it finds optimal value as 3

## apply PCA model to testing set

In [17]:
Yte = MultivariateStats.transform(M, Xte)

3×75 Matrix{Float64}:
  2.72714    2.75491     2.32396   …  -1.92047   -1.74161   -1.37706
 -0.230916  -0.406149    0.646374      0.246554   0.127625  -0.280295
  0.253119   0.0271266  -0.230469     -0.180044  -0.123165  -0.314992

## reconstruct testing observations (approximately)

In [18]:
Xr = MultivariateStats.reconstruct(M, Yte)
using Statistics
r2 = sum((Xte .- Xr).^2)  # calculates the mse between true and predicted data

2.193311349382029

In [12]:
sqrt(r2)

1.4809832373737486

## group results by testing set labels for color coding

In [20]:
setosa     = Yte[:,Xte_labels.=="setosa"]
versicolor = Yte[:,Xte_labels.=="versicolor"]
virginica  = Yte[:,Xte_labels.=="virginica"]

3×25 Matrix{Float64}:
 -1.4126    -1.95359   -3.35517   …  -1.92047   -1.74161   -1.37706
 -0.556727  -0.133821   0.692925      0.246554   0.127625  -0.280295
 -0.214115  -0.075898   0.293002     -0.180044  -0.123165  -0.314992

In [21]:
Xte_labels.=="versicolor"

75-element BitVector:
 0
 0
 0
 0
 0
 0
 0
 0
 0
 0
 0
 0
 0
 ⋮
 0
 0
 0
 0
 0
 0
 0
 0
 0
 0
 0
 0

## visualize first 3 principal components in 3D interactive plot

In [22]:
# visualize first 3 principal components in 3D interacive plot
p = Plots.scatter(setosa[1,:],setosa[2,:],setosa[3,:],marker=:circle,linewidth=0)
Plots.scatter!(versicolor[1,:],versicolor[2,:],versicolor[3,:],marker=:circle,linewidth=0)
Plots.scatter!(virginica[1,:],virginica[2,:],virginica[3,:],marker=:circle,linewidth=0)
Plots.plot!(p,xlabel="PC1",ylabel="PC2",zlabel="PC3")