# Dimensionality Reduction 


Dimensionality reduction helps with reducing the width of the data set (the number columns/variables). 
Reducing the data set width comes in two flavors:

  - Feature Selection - Selecting from existing features
  
  - Dimensionality Reduction - Using numerical methods to transform the feature space from known variables to computed variables


In this notebook, we will have a brief overview of feature selection methods, then we will focus on the dimensionality reduction. 

---

## Feature Selection

If you are dealing with multivariate data, the data usually contains many variables. Not all features are equally significant. You will be able to make better predictions using the minimum possible number of features from the dataset.  When the dataset is huge, computation time matters a great deal. Building models with a minimum number of features helps to reduce the computational effort required. 

Feature selection acts like a filter, eliminating features that aren’t useful. It helps in building predictive models that are free from correlated variables, biases, and unwanted noise.  You might be interested in knowing which features of your data provide the most information about the target variable of interest. 

For example, suppose we’d like to predict iris species using the variables contained in R's iris dataset.

In [None]:
head(iris)

Which of the above four features provides the “purest” segmentation with respect to the target? Or put differently, if you were to place a bet on the correct species and could only ask for the value of one feature, which feature would give you the greatest likelihood of winning your bet?

### Filter Methods: 

These methods apply a statistical measure and assign a score to each feature. The features are selected to be kept or removed from the dataset. The methods are often univariate or with regard to the dependent variable. Some of the  methods that fall into this category include the Chi squared test, information gain, and correlation coefficient scores. 



**Chi-squared Test:**

The chi square / goodness of fit test will check whether significant differences occur within a single category in a categorical variable. We can know the distribution of a variable; if values are equally distributed among different categories then the variable is not providing any new information.
    
Let's see how it works on iris data. 

syntax: chisq.test(x, p)

- x: a numeric vector
- p: a vector of probabilities of the same length of x.

In [1]:
observed = c(50, 50, 50)        # observed frequencies
expected = c(0.333333333333333,0.333333333333333,0.333333333333333)      # expected proportions

chisq.test(x = observed, p = expected)


	Chi-squared test for given probabilities

data:  observed
X-squared = 1.4843e-28, df = 2, p-value = 1


The p-value of the test is 1, which is greater than the significance level alpha = 0.05. We can conclude that the observed proportions are not significantly different from the expected proportions.


**We can use this test to see if two factors are statistically independent or not.** We can find out if two categorical variables provide meaningful information separately or if they are correlated. Null hypothesis in that case would be "two categorical variables are independent" and alternate hypothesis would be "two categorical variables are correlated". 


**Entropy:**

**A helpful video explaining entropy: https://www.youtube.com/watch?v=YtebGVx-Fxw**

Entropy represents the amount of "surprise", "randomness", "uncertainty" in the data; it is a measure of *new* information. In a basket of eggs, every time you pick something without looking, it will be an egg; so there is no surprise; entropy is zero. 
"Sun sets every evening" also does not contain any new information or "surprise", so the level of information it contains is zero. 

For a data set, entropy is always a calculation on a vector of categorical variable values.  It is a summation across each of the possible values the vector can take on.


$$H = -\sum_{i=1}^{n} p_i\log_2 p_i$$


**So to calculate entropy, first multiply the probability of _each value_ within the vector by the quantity (log2 multiplied by that probability).  Then sum those calculations together and multiply by -1.**


For example in the iris dataset, we have 3 possible values for Species (Setosa, Versicolor, Virginica), each representing $\frac{1}{3}$ of the data. (The data set contains 3 classes of 50 instances each, where each class refers to a type of iris plant.) Therefore


$$-\Bigg(\frac{1}{3} \log_2 (\frac{1}{3}) + \frac{1}{3} \log_2 (\frac{1}{3}) + \frac{1}{3} \log_2 (\frac{1}{3})\Bigg) = 1.59$$

----

<div style="float:left;width:600px" id="container">
    <div id="leftContainer" style="float:left;width:500px;">
        <p><b>Example:</b> What is the entropy of a group in which all examples belong to the same class?</p>
    </div>
    <div id="rightContainer" style="float:right;width:100px;">
        <img src="../images/minimum_entropy.PNG" align="center"/>
    </div>
</div>

entropy = H = 
   $$- (1\ *\ log_2(1)) = 0$$

The entropy is 0. This particular feature (class, color, etc.) makes no distinction between the observations, so not very useful for machine learning purposes. 


<div style="float:left;width:600px" id="container">
    <div id="leftContainer" style="float:left;width:500px;">
        <p><b> Example:</b> What is the entropy of a group with 50% in either class?</p>
    </div>
    <div id="rightContainer" style="float:right;width:100px;">
        <img src="../images/maximum_entropy.PNG" align="center"/>
    </div>
</div>

Entropy = H = 
    
$$-(0.5\ *\ log_2(0.5)\ +\ 0.5\ *\ log_2(0.5)) = 1$$
    
This is a useful feature. 

----
<b>Information Gain: </b>

<span style="color:#4286f4; font-weight: bold">Information Gain</span> = <span style="color:#e57f2b; font-weight: bold">Parent Entropy – Weighted Average Entropy of Children</span>


Information gain helps in making two important decisions when building decision trees on data. What is the best split(s) and which is the best variable to split a node.

Along a similar line, we want to determine which attribute in a given set of training feature vectors is most useful for discriminating between the classes to be learned.

    - Information gain tells us the importance of a given attribute of the feature vectors.
    - We will use it to decide the ordering of attributes.
    
Consider following data with 30 elements of which 16 elements are green circles and remaining 14 are pink crosses. 

<img src="../images/circles_and_crosses.PNG">


Subset 1 Child entropy =  $-\Bigg(\frac{13}{17} \log_2 (\frac{13}{17}) + \frac{4}{17} \log_2 (\frac{4}{17})\Bigg) = 0.787$

Subset 2 Child entropy =  $-\Bigg(\frac{1}{13} \log_2 (\frac{1}{13}) + \frac{12}{13} \log_2 (\frac{12}{13})\Bigg) = 0.391$

Parent entropy =  $-\Bigg(\frac{16}{30} \log_2 (\frac{16}{30}) + \frac{14}{30} \log_2 (\frac{14}{30})\Bigg) = 0.996$

Weighted Average Entropy of Children = $\Bigg(\frac{17}{30} * 0.787 \Bigg) + \Bigg(\frac{13}{30} * 0.391 \Bigg) = 0.615$

    
    Information Gain for this split = 0.996 - 0.615 = 0.38 

### Wrapper Methods: 

Wrapper methods use a subset of features and train a model using them. Based on the results drawn from the previous model, features are either added or removed from the subset. The problem is essentially reduced to a search problem. These methods are usually computationally very expensive.

Some common examples of wrapper methods are forward feature selection, backward feature elimination, recursive feature elimination, etc.

**Forward Selection:** Forward selection is an iterative method in which we start with having no feature in the model. In each iteration, we keep adding the feature which best improves our model until the addition of a new variable does not improve the performance of the model.

**Backward Elimination:** In backward elimination, we start with all the features and removes the least significant feature at each iteration which improves the performance of the model. We repeat this until no improvement is observed on removal of features.

**Recursive Feature elimination:** It is a greedy optimization algorithm which aims to find the best performing feature subset. It repeatedly creates models and keeps aside the best or the worst performing feature at each iteration. It constructs the next model with the features left until all the features are exhausted. It then ranks the features based on the order of their elimination.

Read more about Recursive Feature Elimination implementation in the caret package. 

[Feature selection using Caret package](https://www.r-bloggers.com/feature-selection-with-carets-genetic-algorithm-option/)

----

## Dimensionality Reduction


Let's continue the discussion with the communities and crime dataset. 
The data is socio-economic data with a total of 1994 instances and 128 features. 
Out of the 128 variables, 122 are predictive, 5 are non-predictive and one variable is a target variable. 
The first five variables are non predictive so we don't have to consider them when building the model.

The dataset has missing values. 
The per-capita violent crimes variable was calculated using population and the sum of crime variables considered violent crimes in the United States-- namely murder, rape, robbery, and assault. 
There was apparently some controversy in some states concerning the counting of rapes. 
These resulted in missing values for rape, which resulted in incorrect values for per capita violent crime. 
These cities are not included in the dataset. 

Missing values should be treated before building any models. 
All numeric data is normalized into the decimal range 0.00-1.00 using an unsupervised, equal-interval binning method. 
Read the description about the dataset by opening a terminal and running this command:

```Bash
less /dsa/data/all_datasets/crime/readme.txt
```

The actual data doesn't have any column headers. 
You need to grab the headers information from the readme file. 
We have to do a little bit of data carpentry before we can start using the data to apply linear regression on it.

The headers information is present in readme file. 
Keep this information in a separate file called names.txt so we can access only the part of data we are interetsed in. The headers data in names.txt has so much unwanted information.
A sample record is shown below

    '-- state: US state (by number) - not counted as predictive above, but if considered, should be consided nominal (nominal)'

The only thing we are interested in is the first word in every line, which is the actual column name. 
So read the data separating every word using the parameter sep="". 
Header will be FALSE, because we don't have the header in the actual data file. 

In [None]:
column_names = read.csv('/dsa/data/all_datasets/crime/names.txt', header = FALSE, sep="")
head(column_names)

Attribute names are extracted but they still need some cleaning. Every attribute name has a ':' appended at the end. Get rid of the ':' from every word using gsub() function. It will replace characters in a string.

In [None]:
# Attribute names are in 2nd column. Extract them. 
column_names = column_names[,2]

# The first argument to gsub() ':' is replaced with second argument ''(nothing here) from every string in names.
column_names = gsub(':','', column_names)
head(column_names)

We are all set to assign these names to crime dataset.

**Note** Error expected!

In [None]:
# Uncomment the lines of code below and run it.
crime_data <- read.csv('/dsa/data/all_datasets/crime/communities_and_crime.txt',header=FALSE)
names(crime_data)=column_names

**The error** is saying something about the lengths of vector column_names and names() attribute. Check the lengths of column_names vector and number of columns in crime_data dataset.

In [None]:
ncol(crime_data)
length(column_names)

There are 132 names in column_names vector which we are trying to assign to 128 columns/variables in crime_data dataframe. Some how we ended up extracting 132 names instead of 128. If we observe the names vector closely we can see what are those extra names.

In [None]:
column_names

"-", "and", "(numeric", "Part" are the four names that were created that are not actual column names. Once we eliminate these we should be good to go.

In [None]:
# In the below command, we are using the negation operator ('!') to select strings in the names vector which 
# are not in the specified list 
column_names = column_names [! column_names %in% c('-', 'and', '(numeric', 'Part')]
length(column_names)

Now that we have names for our columns in actual crime_data, let's assign them.

In [None]:
names(crime_data) = column_names
head(crime_data)

#### Splitting the data into train and test sets

(**How will you check the accuracy or how good is the fit of your model?**)

You cannot build and test the model on the same data. You have to test the accuracy of the model on unknown test data. R has libraries to split the data into train and test datasets. 

Split the dataset into training and testing datasets. We can do this using the caTools package, as shown below.

In [None]:
# set.seed() is used return a reproducible sample. It helps to split the data in the same, equal partitions 
# no matter how many times it's split.
set.seed(144)

library(devtools)
library(caTools)

split = sample.split(crime_data$ViolentCrimesPerPop, SplitRatio = 0.7)

crime_train_data = subset(crime_data, split == TRUE)

crime_test_data = subset(crime_data, split == FALSE)

nrow(crime_train_data)
nrow(crime_test_data)

In [None]:
head(crime_test_data)

**Dimensionality reduction is not the same as feature selection.** Even though both try to reduce the number of attributes in the dataset, dimensionality reduction methods create new combinations of attributes, whereas the feature selection method includes and excludes attributes present in the data without altering them. Principal Component Analysis, Singular Value Decomposition, Factor Analysis and Sammon’s Mapping, etc. are all examples of dimensionality reduction.

Here are some of the simplest of techniques for dimensionality reduction/variable exclusion...

**Missing Values Ratio:** Columns with many missing values carry less useful information. Thus, if the number of missing values in a column is greater than a threshold value it can be removed.

**Low Variance Filter:** Columns with little variance in data carry little information. Thus, if the number of values in a column is less than a threshold value it can be removed. Variance is range dependent. Therefore, data should be normalized before applying this technique.

**High Correlation Filter:** Columns with high correletion provide almost the same information. One of them is enough to feed data to the model. Correlation is scale sensitive. So, column normalization should be done for a meaningful correlation comparison.

**Random Forests / Ensemble Trees:**. Decision Tree Ensembles or random forests are useful for feature selection as well as data classification. Trees are constructed with attributes as nodes. If an attribute is selected as best split, it is likely to be the most informative feature of dataset.

**Principal Component Analysis (PCA):**. Principal Component Analysis (PCA) is a statistical technique takes n features of the dataset to transform into a new set of n coordinates called principal components. The transformation helps the first principal component to explain the largest possible variance. The components following have the next highest possible variance without any correletion with other components.
[Additional Reading](https://www.r-bloggers.com/principal-component-analysis-using-r/)


In [None]:
summary(crime_train_data)

In [None]:
table(crime_train_data$LemasSwFTFieldPerPop == '?')
table(is.na(crime_train_data$LemasSwFTFieldPerPop))

There are many variables who have missing values filled with `?`. 

In [None]:
head(crime_train_data)

---

## Principal Component Analysis


##### Centering and Standardizing Variables

Principal Component Analysis in R: http://www.sthda.com/english/articles/31-principal-component-methods-in-r-practical-guide/112-pca-principal-component-analysis-essentials/#biplot


Standardizing the variables is very important if we have to perform principal component analysis on the variables. 
If the variables are not standardized, then variables with large variances dominate other variables.

When the variables are standardized, they will all have variance 1 and mean 0. This would allow us to find the principal components that provide the best low-dimensional representation of the variation in the original data, without being overly biased by those variables that show the most variance in the original data.

We will use `scale()` function In R to standardize the variables.

In [None]:
standard_vars <- as.data.frame(scale(crime_train_data[!sapply(crime_train_data,class) %in% c('factor')]))
dim(standard_vars)
head(standard_vars)

# It is also possible to normalize all the varibles by specifying "center" and "scale." arguments
# in the prcomp() function.  Do this by specifying that "center = TRUE" and "scale. = TRUE".

You can verify the means and standard deviations of the variables. The means will be nearly equal to zero and all standard deviations will equal 1.

In [None]:
sapply(standard_vars,mean)

In [None]:
sapply(standard_vars,sd)

<span style="color:#ea8d12; font-weight:700">Helpful video for understanding and plotting results from prcomp(): </span>https://www.youtube.com/watch?v=0Jp4gsfOLMs


<span style="color:#ea8d12; font-weight:700">prcomp() versus princomp(): </span>http://www.sthda.com/english/articles/31-principal-component-methods-in-r-practical-guide/118-principal-component-analysis-in-r-prcomp-vs-princomp/

In [None]:
help(prcomp)
crime_train_data_pca <- prcomp(standard_vars)

# Note: prcomp() expects the samples to be rows and the variables to be columns.  If that's 
# not the case, you must use the t() function to transpose the matrix.

# prcomp() returns several things: (1) x, (2) sdev, and (3) rotation.  

# Each principal component is a normalized linear combination of original variables.  The rotations 
# (i.e., loadings) are the coefficients of the linear combinations of the continuous variables.

# print(crime_train_data_pca) will return the principal components with the coefficients for each
# continuous variable.  In these returned values, a positive coefficient indicates that as as a 
# particular PC increases, the variable(s) with positive coefficients also increase.  Variables 
# with negative coefficients decrease, relative to the PC. 

In [None]:
summary(crime_train_data_pca)

#### Number of Principal Components to Retain

A scree plot helps us to decide on number of principal components to be retained. The plot will summarize the PCA analysis results. The `screeplot()` function in R will help us to do this.

In [None]:
screeplot(crime_train_data_pca, type="lines")

The most obvious change in slope in the scree plot occurs at component 7, therefore first six components should be retained.

Another approach to decide on number of PCA components to choose is by using Kaiser’s criterion. It suggests that we should only retain principal components for which the variance is above 1 (on standardized variables). We can check this by finding the variance of each of the principal components. The standard deviations of PCA components are saved in a standard variable called sdev. You can access it in crime_train_data_pca dataframe.

In [None]:
(crime_train_data_pca$sdev)^2

The components 1 through 14 have variance above 1. Using Kaiser’s criterion, we can retain the first fourteen principal components.

One more method to decide on number of PCA components to retain is to keep as few components as required to explain at least some minimum amount of the total variance. For example, if you want to explain at least 70% of the variance, we will retain the first eight principal components, as we can see from the output of `summary(crime_train_data_pca)` that the first eight principal components explain 70% of the variance (while the first four components explain 56%).

### Scatter Plots of Principal Components

The values of the principal components are stored in a named element `x` of the variable returned by `prcomp()`. `x` contains a matrix where the first column contains the first principal component, the second column the second component, and so on.

Thus, `housing_prices_pca$x[,1]` contains the first principal component, and `housing_prices_pca$x[,2]` contains the second principal component.

We will make a scatterplot of the first two principal components.

In [None]:
library(ggplot2)
pca_comp1_comp2 <- ggplot(crime_train_data, aes(x=crime_train_data_pca$x[,1],y=crime_train_data_pca$x[,2]))

pca_comp1_comp2+geom_point(alpha = 0.8)

In [None]:
# Calculating total number of elements in the dataset so that we can use this, in the following line of code,
# to replace the principal components values with dots (to differentiate them from the variable coefficients.)

len <- length(as.matrix(crime_train_data)) / length(crime_train_data)

len

biplot(crime_train_data_pca, xlabs = rep('.', len))

# Another option: Could put "nrow(crime_train_data)" in place of "len" in the last line of code.

# rep() replicates the values in x
# xlabs is a vector of character strings used to label the first set of points.  In this case, the xlabs() function 
# is being used simply to differentiate the PC-1 values from the PC-2 values.  

---

## Factor Analysis


Factor analysis is a statistical method used to describe variability among observed, correlated variables in terms of a potentially lower number of unobserved variables called factors. Factor analysis searches for such joint variations in response to unobserved latent variables. The observed variables are modelled as linear combinations of the potential factors, plus “error” terms. The information gained about the interdependencies between observed variables can be used later to reduce the set of variables in a dataset. 

Latent variables (as opposed to observable variables), are variables that are not directly observed but are rather inferred (through a mathematical model) from other variables that are observed (directly measured).  Since factors are latent, we cannot use methods like regression.

The key concept of factor analysis is that multiple observed variables have similar patterns of responses because they are all associated with a latent (i.e. not directly measured) variable. For example, people may respond similarly to questions about income, education, and occupation, which are all associated with the latent variable "socioeconomic status". It is possible that variations in n observed variables reflect the variations in just two unobserved (underlying) variables. Factor analysis searches for such joint variations in response to unobserved latent variables and aims to find independent latent variables.

Factor Analysis is a method for analyzing the covariation among the observed variables to address following questions:

* How many latent factors are needed to account for most of the variation among the observed variables?
* Which variables appear to define each factor; hence what labels should we give to these factors?

Factors are listed according to factor loadings, or how much variation in the data they can explain. 

So, we want to investigate if observable variables (e.g., X1,X2…XN) are linearly related to a small number of unobservable (latent) factors (e.g., F1,F2…FK), with K<<N.

See this link for more information: http://www.di.fc.ul.pt/~jpn/r/factoranalysis/factoranalysis.html

There are the following assumptions:

1. The error terms are independent from each other  
2. The unobservable factors are independent from each other

The second assumption is stating that these latent variables do not influence one another, which might be too strong a condition. There are more advanced models where this is relaxed.

With the loadings it is possible to compute the covariance of any two observed variables as well as the variance of each variable.

The values of the loadings are not unique (in fact, they are infinite). This means that if the algorithm finds one solution that does not reveal the hypothesized structure of the problem, it is possible to apply a ‘rotation’ to find another set of loadings that might provide a better interpretation or more consistent with prior expectations about the dataset.

There are a number of rotations in the literature. For example:

1. Varimax: a rotation that seeks to maximize the variance of the squared loading for each factor (ie, make them as large as possible to capture as most signal as possible)
2. Quartimax : seeks to maximize the variance of the squared loadings for each variable, and tends to produce factors with high loadings for all variables.

Rotation methods can be described as orthogonal, which do not allow the resulting factors to be correlated, and oblique, which do allow the resulting factors to be correlated.

There are two types of factor analysis: exploratory and confirmatory.

##### Exploratory factor analysis
It is done if a researcher doesn’t have any idea about the structure of data or how many dimensions are in a set of variables. It helps identify complex interrelationships among items and group items that are part of unified concepts.

##### Confirmatory Factor Analysis
It is used for verification where the researcher has specific idea about the structure of data or how many dimensions are in a set of variables. It helps test the hypothesis that the items are associated with specific factors. Hypothesized models are tested against actual data, and the analysis would demonstrate loadings of observed variables on the latent variables (factors), as well as the correlation between the latent variables.

#### Factor Analysis vs. PCA


Factor analysis is related to principal component analysis (PCA), but the two are not identical. PCA is a more basic version of exploratory factor analysis (EFA). Factor Analysis reduces the information in a model by reducing the dimensions of the observations.  This procedure has multiple purposes.  It can be used to simplify the data, for example reducing the number of variables in predictive regression models.  If factor analysis is used for these purposes, most often factors are rotated after extraction.  Factor analysis has several different rotation methods—some of them ensure that the factors are orthogonal.  Then the correlation coefficient between two factors is zero, which eliminates problems of multicollinearity in regression analysis.

Both factor analysis and PCA assume that the modelling subspace is linear. (Kernel PCA is a more recent techniques that attempts dimensionality reduction in non-linear spaces.)

But while Factor Analysis assumes a model (that may or may not fit the data), PCA is just a data transformation and for this reason it always exists. Furthermore while Factor Analysis aims at explaining covariances or correlations, PCA concentrates on variances. 

---

### Load in the dataset...

Let's perform a factor analysis on student subject preferences data. The dataset contains a hypothetical sample of 300 responses on 6 items from a survey of college students’ favorite subject matter. The items range in value from 1 to 5, which represent a scale from Strongly Dislike to Strongly Like. Our 6 items asked students to rate their liking of different college subject matter areas, including biology (BIO), geology (GEO), chemistry (CHEM), algebra (ALG), calculus (CALC), and statistics (STAT).

In [None]:
subjects_data = read.csv("/dsa/data/all_datasets/student_prefs/student_subject_preferences.csv")
head(subjects_data)

In [None]:
str(subjects_data)

Package `stats` has a function factanal() that can be used to perform factor analysis:

In [None]:
# factanal() performs maximum-likelihood factor analysis on a covariance matrix or data matrix.

# The second argument in factanal() is "factors" -- meaning the number of factors to be fitted.

# The scores argument scores specifies the type of scores to produce, if any. 
# The default is none; the "regression" argument gives Thompson's scores.

n.factors <- 2   

fit <- factanal(subjects_data, n.factors,  scores=c("regression"), rotation="none") # number of factors to extract
print(fit, digits=2, cutoff=.3, sort=TRUE)

In [None]:
head(fit$scores)

# For a description of Thomson's regression method, see
# https://www.rdocumentation.org/packages/stats/versions/3.5.1/topics/factanal

In [None]:
fit$loadings[,1:2] 

# Factor loading is the correlation between the observed score and the latent score. Generally, the 
# higher the better since the square of factor loading can be directly translated as item reliability.

In [None]:
# plot factor 1 by factor 2 
load <- fit$loadings[,1:2] 
plot(load, type = "n") # Set up plot. type = 'n' tells R not to plot the points. 

text(load, labels = names(subjects_data), cex = 0.7) # text() adds variable names.  cex() controls the font size

The output maximizes variance for the 1st and subsequent factors, while all are orthogonal to each other.

Rotation serves to make the output more understandable, by seeking so-called “Simple Structure”. 

Simple structure is a pattern of loadings where items load most strongly on one factor, 
and much more weakly onto the other factors. 


In other words, varimax rotation is an orthogonal rotation of the factor axes to maximize the variance of the squared loadings of a factor (column) on all the variables (rows) in a factor matrix, which has the effect of differentiating the original variables by the extracted factor. 
Each factor will tend to have either large or small loadings of any particular variable. 


A varimax solution yields results which make it as easy as possible to identify each variable with a single factor. This is the most common rotation option. 

In [None]:
fit <- factanal(subjects_data, n.factors, rotation="varimax")     # 'varimax' is an ortho rotation

load <- fit$loadings[,] 
load

In [None]:
plot(load, type = "n") # Set up plot  
 
text(load,labels = names(subjects_data), cex = .7) # add variable names

### Factor Analysis Interpretation 

Looking at both plots we see that the courses Geology, Biology, and Chemistry all have high factor loadings around 0.8 on the first factor (PA1) while Calculus, Algebra, and Statistics load highly on the second factor (PA2). 

Note that STAT has a much lower loading on PA2 than ALG or CALC and that it has a slight loading on factor PA1. 
This suggests that statistics is less related to the concept of Math than Algebra and Calculus. 
Just below the loadings table, we can see that each factor accounted for around 30% of the variance in responses, 
leading to a factor solution that accounted for 66% of the total variance in students’ subject matter preference.

The way to interpret factors is to look at the observed variables that each factor contribute to:

__Factor 1__ : contributes to Biology, Geography, and Chemistry  
__Factor 2__ : contributed to Algebra, Calculus, and Statistics  

Can we assign a conceptual label to the factors, based on the measurement variables they are contributing to?

Yes!  We can associate the first factor with **Science** and the second factor with **Math**.  If these were scores on standardized tests, we could use the factor analysis to plot students into sets of ''Science Kids'' and ''Math Kids''.
