# Factor Analysis

Unlike PCA, factor analysis attempts to explain only the shared variance between variables.

Further, the number of factors you use changes the values associated with the factors.

Factor analysis builds a model of similar form to a regression.
That is, a single-factor factor analysis has the same interpretation as a regression model.
It has the form:  
$X_j = \mu_j \lambda_j f + e_j$,  
where $X_j$ is the outcome variable predicted by a constant $\mu_j$ and slope $\lambda_j$ for a linear relationship with variable $f$.
Finally, there is an error term $e_j$.

A general factor analysis is similar to a multiple regression, with $m$ factors $f_i$.
Then for each observed variable $X_j$, we have a model of the form:  
$X_j = \mu_j + \lambda_{j1} f_1  + \lambda_{j2} f_2 + \ldots + \lambda_{jm} f_m + e_j$

Now, unlike a true regression analysis, we don't have direct observation of the factors $f_i$.
We make the assumption that such a model form describes each outcome variable, and then try to estimate the $\lambda_{ji}$.
We'll refer to these "regression slopes" as "factor loadings."

So, this leads us to only attempt a factor analysis if we have multiple outcome variables $X_j$, I believe.

### Some simplifying assumptions

We can make some simplifying assumptions for our latent factors $f_i$.
Because, they're "latent," not a real metric, we can safely make these assumptions.
Really, we're putting constraints on our solution, I think.

1. Assume all $f_i$ have mean 0.
2. Assume all $f_i$ have standard deviation of 1.

From these two, we basically are assuming $f_i$ are standardized, or z-scored.

3. Missed
4. Missed
5. $\forall j, k: Cov(e_j, f_k) = 0$.
    That is, factors are non-covariate with random errors.
6. $\forall f_k, f_{k'}: Cov(f_k, f_{k'})$.
    That is, factors are pairwise non-covariate.
    Advanced techniques relax this assumption.
7. $\forall e_j, e_{j'}: Cov(e_j, e_{j'}) = 0$.
    That is, random error for variables are pairwise non-covariate.

Based on the assumptions, we can use some algebra on variances and covariances to establish a relationship between population covariance matrix $\Sigma_{xx}$ and the parameters on factors $f_i$ (that is, the $\lambda_{ji}$ values).

## Computing stuff


### Single-factor case

Recall, covariance of a constant with a variable is always 0, since constant has $\sigma$ of 0.
Also, variance of a variable plus a constant is just the variance of the variable.


#### Proving $Var(X_j) = \lambda_j^2 + \psi_j$

Now, we have $Var(X_j) = Var(\mu_j + \lambda_j f + e_j)$, since $X_j$ is assumed to be modeled by $\mu_j + \lambda_j f + e_j$.  
$\mu_j$ is constant, so $Var(\mu_j + \lambda_j f + e_j) = Var(\lambda_j f + e_j)$  
Further, since $Var(x + y) = Var(x) + Var(y) + 2Cov(x, y)$, we have $Var(\lambda_j f + e_j) = Var(\lambda_j f) + Var(e_j) + 2 Cov(\lambda_j f, e_j)$.  
But $Cov(\lambda_j f, e_j) = 0$ by assumption, so we have $Var(\lambda_j f) + Var(e_j) + 2 Cov(\lambda_j f, e_j) = Var(\lambda_j f) + Var(e_j)$.  
We can factor the first term to get $\lambda_j^2 Var(f) + Var(e_j)$.
Finally, we have $Var(f) = 1$ by assumption, and we notate the variance of error by $\psi_j = Var(e_j)$.  
$\therefore Var(X_j) = \lambda_j^2 + \psi_j$.

Basically, the proof above shows that under our assumptions/constraints for the factors $f$, the variance of our observed outcome variable is the sum of squares of slopes (single slope for single-factor case), plus some error variance unexplained by the model.

#### Proving Cov(X_i, X_j) = \lambda_i \lambda_j

By assumption, $Cov(X_i, X_j) = Cov(\mu_i + \lambda_i f + e_i, \mu_j + \lambda_j f + e_j)$.  
And... stuff, we should come back to it.

Example.  
Consider case where we have 3 variables.
We get a 3x3 covariance matrix $S$.

Now, our analysis will have three models, for $X_1$, $X_2$, and $X_3$.
This gives us six unkowns, namely $\lambda_1, \lambda_2, \lambda_3, \psi_1, \psi_2, \psi_3$.

Now, our covariance matrix has the form:
| Var($X_1$) | Cov($X_1, X_2$) | Cov($X_1, X_3$) |
|---|---|---|
| Cov($X_2, X_1$) | Var($X_2$) | Cov($X_2, X_3$) |
| Cov($X_3, X_1$) | Cov($X_3, X_2$) | Var($X_3$) |

By substituting the equations proved above, we get:
| $\lambda_1^2 + \psi_1$ | $\lambda_1 \lambda_2$ | $\lambda_1 \lambda_3$ |
|---|---|---|
| $\lambda_2 \lambda_1$ | $\lambda_2^2 + \psi_2$ | $\lambda_2 \lambda_3$ |
| $\lambda_3 \lambda_1$ | $\lambda_3 \lambda_2$ | $\lambda_3^2 + \psi_2$ |

From these, we can extract 6 equations in 6 unkowns.  
Actually, that was probably a tad overcomplicated.

Even without the matrix, we could identify six equations:
$Var(X_1) = \lambda_1^2 + \psi_1$,  
$Var(X_2) = \lambda_2^2 + \psi_2$,  
$Var(X_3) = \lambda_3^2 + \psi_3$,  
$Cov(X_1, X_2) = \lambda_1 \lambda_2$,  
$Cov(X_1, X_3) = \lambda_1 \lambda_3$,  
$Cov(X_2, X_3) = \lambda_2 \lambda_3$.  

This is six equations in six unknowns, since we can compute the left-hand sides.
These can be solved, though it's not quite a linear system, and I don't see an obvious way to linearize it.

Now, suppose we only had two observed variables.
Then our covariance matrix would be 2x2, with just 3 unique elements.  
However, because we have two variables, we have four unknowns.
Then the problem is underidentified, and there are infinite solutions.  

On the other hand, with a 4x4 matrix, we have 10 unique values and just 8 unknowns.
Then we have an overidentified problem, which is actually preferable.
We can do some form of best-fit solution.
Then we can get a meaningful measure of statistical fit of the model.

For a perfectly identified problem (the 3x3 case) the fit is always either 1 or undefined (if the problem has no solution).

Note that in general, for $n \times n$ matrix, we have $\frac{n(n+1)}{2}$ unique values in covariance matrix and $2n$ unkowns.
If we have $m$ factors, I think it's actually $n(m+1)$ unknowns, maybe? Need to think more carefully.

Finally, note that for a best-fit solution, least-squares (regression) is generally not going to be desirable, I guess the fact we're using covariances and not regular-old random variables results in nondesirable statistical properties.
There are other best-fit approaches that are better.

### Evaluating model

Um, kinda missed the justification, but given our sample correlation matrix $R$ and estimated population correlation matrix $\hat{\Sigma_{zz}}$, we can take $R - \hat{\Sigma_{zz}}$ to get the residuals of correlations.

Then any positive residuals indicate the model is underpredicting the correlation, and negative residuals indicate the model is overpredicting the correlation.
That is, a positive indicates that, based on the sample, the correlation should be higher than what the model is predicting, so there is some correlation not explained by the model.
On the other hand, negative means there's actually less correlation than the model suggests.  
The absolute values indicate the degree to which we are over/under-predicting.
Residuals near zero indicate the model is a very good fit for those parts of the data.

Individual residuals are interpreted in terms of the question of whether or not there is a common factor underlying a pair of variables, I think.
If we see a lot of high-value residuals for a specific subset of variables against another subset, we might infer that another factor exists, and do an analysis with $m+1$ factors.

## Correlation vs Covariance

Similar to PCA, we have possibility to analyze covariance or correlation matrices.

For covariance, we'll have covariance matrix $S$, which is sample covariance matrix, and we'll be using $\Sigma_{xx}$ as the population covariance matrix, with $\hat\Sigma_{xx}$ as our estimated population covariance matrix.

For correlation, our notations will be $R$, $\Sigma_{zz}$, and $\hat\Sigma_{zz}$ as the respective correlation matrices.

## Factor Analysis vs. PCA

1. PCA explains total variance.
    Factor analysis distinguishes between common and unique variance.  
    The **communality** of variable $X_j = \lambda_{j1}^2 + \lambda_{j2}^2 + \ldots + \lambda_{jk}^2$.
    I have no idea what the fuck that means, in terms of interpretation.  
    The **uniqueness** of $X_j = \psi_j$.
    Also called the **specific variance**.
2. PCA components (equivalent of factors) are functions of the indicator (observed, manifest) variables; in factor analysis, the indicator variables are functions of the factors.  
    So, components are a composite of observed variables, which may represent a generalized concept with observed variables as facets of the concept.  
    On the other hand, factors are implicit explainers for observed values.