# Dimensionality Reduction 2


## PCA

 
<span style="color: #cc1652; font-weight:700"> Video explanation of PCA: </span>https://www.youtube.com/watch?v=FgakZw6K1QQ or https://www.youtube.com/watch?v=g-Hb26agBFg

In [None]:
# Pull seed data down
# Info about the bdims dataset: https://www.openintro.org/stat/data/bdims.php

download.file("http://www.openintro.org/stat/data/bdims.RData", destfile = "bdims.RData")
load("bdims.RData")

names(bdims)

In [None]:
# Remove 'sex', since it is a factor

!names(bdims) %in% c('sex') # In base R, this returns a logical value for each column, as to whether each column 
                            # is "not in" the 'sex' column.  # the c() function combines the elements of the 
                            # argument to form a vector

lessData <- bdims[,!names(bdims) %in% c('sex')]  # Subset the bdims df to only include those columns that are
                                                # "not in" the 'sex' column.  Name this subset lessData.

head(lessData)

ncol(lessData)

__Reference__: 
 - http://www.sthda.com/english/articles/31-principal-component-methods-in-r-practical-guide/112-pca-principal-component-analysis-essentials/
 - http://www.sthda.com/english/wiki/principal-component-analysis-in-r-prcomp-vs-princomp-r-software-and-data-mining

In [None]:
cor(lessData)

In [None]:
library(corrplot)

corrplot(cor(lessData), order='hclust')

In [None]:
# Compute the Principal Components

pca <- princomp(lessData, cor=TRUE)

# The elements of the list (in this case, assigned to the object "pca") returned by the princomp() 
# function include sdev, loadings, center, scale, and scores. 

In [None]:
summary(pca) # print variance accounted for

### IMPORTANT PCA information

The __Proportion of Variance__ and __Cumulative Proportion__ help you see how important or significant the components are.  

Note that the first principal component (PC) captures __0.6248721__ of the total variance.
So PC1 accounts for about 62.5% of the total variance.

By looking at the __Cumulative Proportion__, we can see that PC 1 through PC 19 capture 99% of the total variance.

In [None]:
loadings(pca) # pc loadings

# In this matrix of variable loadings, the columns are eigenvectors.

### Scree Plot

Next, we will look at the trend of variance captured as we progress from the first PC to the last.
This is typically called a _Scree_ plot.

In [None]:
plot(pca,type="lines") # scree plot

In [None]:
library("factoextra")

fviz_eig(pca)

**We can see that after the first two PC, the contribution to variance is very minimal.**

In [None]:
reduced <- pca$scores[,1:2] # the first 2 principal components

# Scores returns the coordinates of the observations on the principal components.

summary(reduced)

### Biplot: Visualization and Interpretation

The biplot is a very popular way for visualization of results from PCA, as it combines both, the principal component scores and the loading vectors in a single biplot display. In R we simply call the biplot() function. The scale = 0 argument to biplot() ensures that the arrows are scaled to represent the loadings.

In the biplot the observations are labeled by the observation number (e.g. the row name in data frame). The position in the plot represents the scores for the first two principal components. The original variables are shown as vectors (arrows). They begin at the origin and extend to coordinates given by the first two principal component loading vectors.

In [None]:
options(repr.plot.width=12, repr.plot.height=12)
biplot(pca, scale=0) 

### Interpreting a Biplot


The left and bottom axes of a biplot are a pair of principal components labeled Comp.1 and Comp.2.
The right and top axes represents the coefficients/loadings of the variables. 

A biplot uses **points to represent the scores of the observations** on the principal components, and it uses **vectors to represent the coefficients of the variables** on the principal components.

__Interpreting Points__: The relative location of the points can be interpreted. Points that are close together correspond to observations that have similar scores on the components displayed in the plot. To the extent that these components fit the data well, the points also correspond to observations that have similar values on the variables.

The points that are close together are data members with similar projections/positions in the transformed space.


__Interpreting Vectors__: Both the direction and length of the vectors can be interpreted. Vectors point away from the origin in some direction.

A vector shows how a variable is represented by the two principal components, or how much it contributes to the principal components. For example, vectors close to horizontal mostly contribute to Comp.1; close to vertical mostly contribute to  Comp.2. The angles in between vectors show the correlation between the variables. Vectors pointing in the same direction are variables highly correlated. 


