# Statistical Analysis II - Practicum 3 (Week 11)

## Factor analysis

In [None]:
from factor_analyzer import FactorAnalyzer

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

from sklearn.decomposition import FactorAnalysis, PCA
from sklearn.preprocessing import StandardScaler
from sklearn.datasets import load_iris

### Factor analysis

Resources from [url1](https://www.datasklr.com/principal-component-analysis-and-factor-analysis/factor-analysis), [url2](https://www.analyticsvidhya.com/blog/2020/10/dimensionality-reduction-using-factor-analysis-in-python/), [url3](https://scikit-learn.org/stable/auto_examples/decomposition/plot_varimax_fa.html#sphx-glr-auto-examples-decomposition-plot-varimax-fa-py), and [url4](https://www.datacamp.com/tutorial/introduction-factor-analysis).

The documentation on the packages used is available from [link1](https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.FactorAnalysis.html) and [link2](https://factor-analyzer.readthedocs.io/en/latest/).

**Factor analysis** together with PCA (and other techniques) constitutes **multivariate analysis**.

- Factor analysis (FA) is also used for dimensionality reduction, but to describe variability among *observed* and (potentially) *correlated* variables. 

- The resulting variables describing these correlations, as linear combinations of the observed input variables, are called *latent variables* (latent, because they are not directly measured), or *factors*. 

- Factors is to represent the common variance, i.e. the part of the variance that occurs due to correlation among input factors.

- The contribution of the input variables to the factors is called *factor loading*.

- $x_i-\mu_i = \sum_k l_{i,k}*F_{i,k} + \epsilon_i$

- *Confirmatory FA* is used when one has already in mind which input variables combined may produce meaningful latent variables. This is conversely unknwown in *Exploratory FA*.

Let us look at a practical example of a recruiter that wants to hire employees for a business firm. 
The interview process has been over and for each personality of the interviewee, they have been rated out of 10.

In [None]:
# create the data frame
dataframe = pd.read_csv('./Datasets/dataset1.txt', sep=" ", header=0, index_col=0)
dataframe.head()

In [None]:
cols = dataframe.columns

We want to determine if latent variables can be determined to reduce the dimensionality of this problem consisting of 32 variables.

Firstly, how much are the input variables correlated?

In [None]:
ax = plt.axes()

im = ax.imshow(np.corrcoef(dataframe.T), cmap="RdBu_r", vmin=-1, vmax=1)

plt.colorbar(im).ax.set_ylabel("$r$", rotation=0)

ax.set_xticks([r for r in range(len(cols))])
ax.set_xticklabels(list(cols), rotation=90)
ax.set_yticks([r for r in range(len(cols))])
ax.set_yticklabels(list(cols))

ax.set_title("Interviewees variable correlation matrix")
plt.tight_layout()

How many factors can be used as latent variables?

In [None]:
fa = FactorAnalyzer(rotation = None,impute = "drop",n_factors=dataframe.shape[1])

fa.fit(dataframe)

ev,_ = fa.get_eigenvalues()

plt.scatter(range(1,dataframe.shape[1]+1),ev)
plt.plot(range(1,dataframe.shape[1]+1),ev)
plt.title('Screening Plot')
plt.xlabel('Factors')
plt.ylabel('Eigen Value') #Eigen values are the number of features each factor accounts for.
plt.grid()

How do the factors relate to the input variables?

In [None]:
fa = FactorAnalyzer(n_factors=6,rotation='varimax') #varimax: few large and lots of close to 0 factor loadings

fa.fit(dataframe)

print(pd.DataFrame(fa.loadings_,index=dataframe.columns))

In [None]:
Z=np.abs(fa.loadings_)

fig, ax = plt.subplots()
c = ax.pcolor(Z)
fig.colorbar(c, ax=ax)
ax.set_yticks([r+0.5 for r in range(len(cols))])
ax.set_yticklabels(list(cols))
ax.set_xticks(np.arange(fa.loadings_.shape[1])+0.5, minor=False)
ax.set_xticklabels(np.arange(fa.loadings_.shape[1]), minor=False)
plt.show()

What is the amount of variance explained by the factors?

In [None]:
print(pd.DataFrame(fa.get_factor_variance(),index=['Variance','Proportional Var','Cumulative Var']))

What are the communalities (i.e. the proportion of variance that can be explained for each input variable) of the factors?

In [None]:
print(pd.DataFrame(fa.get_communalities(),index=dataframe.columns,columns=['Communalities']))

What are the main differences between PCA and FA? Let's find out with another example, the _iris database_.

In [None]:
data = load_iris()
X = StandardScaler().fit_transform(data["data"])
variables_names = data["feature_names"]
print(data)

How does the covariance matrix look like?

In [None]:
ax = plt.axes()

im = ax.imshow(np.corrcoef(X.T), cmap="RdBu_r", vmin=-1, vmax=1)

ax.set_xticks([0, 1, 2, 3])
ax.set_xticklabels(list(variables_names), rotation=90)
ax.set_yticks([0, 1, 2, 3])
ax.set_yticklabels(list(variables_names))

plt.colorbar(im).ax.set_ylabel("$r$", rotation=0)
ax.set_title("Iris feature correlation matrix")
plt.tight_layout()

Let's compare PCA vs. FA

In [None]:
n_comps = 2

methods = [
    ("PCA", PCA()),
    ("Unrotated FA", FactorAnalysis()),
    ("Varimax FA", FactorAnalysis(rotation="varimax")),
]
fig, axes = plt.subplots(ncols=len(methods), figsize=(10, 8))

for ax, (method, fa) in zip(axes, methods):
    fa.set_params(n_components=n_comps)
    fa.fit(X)

    components = fa.components_.T
    print("\n\n %s :\n" % method)
    print(components)

    vmax = np.abs(components).max()
    ax.imshow(components, cmap="RdBu_r", vmax=vmax, vmin=-vmax)
    ax.set_yticks(np.arange(len(variables_names)))
    if ax.is_first_col():
        ax.set_yticklabels(variables_names)
    else:
        ax.set_yticklabels([])
    ax.set_title(str(method))
    ax.set_xticks([0, 1])
    ax.set_xticklabels(["Comp. 1", "Comp. 2"])
fig.suptitle("Factors")
plt.tight_layout()
plt.show()

- PCA components explain the maximum amount of variance while factor analysis explains the covariance in data.

- PCA components are fully orthogonal to each other whereas factor analysis does not require factors to be orthogonal.

- PCA component is a linear combination of the observed variable while in FA, the observed variables are linear combinations of the unobserved variable or factor.

- PCA components are uninterpretable. In FA, underlying factors are labelable and interpretable.

- PCA is a kind of dimensionality reduction method whereas factor analysis is the latent variable method.

- PCA is a type of factor analysis. PCA is observational whereas FA is a modeling technique.

# If you have any question: s.lopiano@reading.ac.uk