# Factor Analysis

A method of data exploration, which explains why data is acting a certain way

Factors (also called Latent Variables) are infered rather than directly observable.

## This model assumes that:
1. features are metric
2. continuous or ordinal
3. the r > 0.3 (correlation coefficient) between the features in the dataset
4. there's > 100 observations and > 5 observations per feature
5. the sample is homogenious

## Factor loading:

~ -1 | 1 - has strong influence on a variable
~ 0 has no or weak influence on a variable
'>' 1 = highly correlated variables

In [1]:
import numpy as np
import pandas as pd

import sklearn
from sklearn.decomposition import FactorAnalysis
from sklearn import datasets

In [2]:
iris = datasets.load_iris()

X = iris.data
variable_names = iris.feature_names

X[:10,]

array([[5.1, 3.5, 1.4, 0.2],
       [4.9, 3. , 1.4, 0.2],
       [4.7, 3.2, 1.3, 0.2],
       [4.6, 3.1, 1.5, 0.2],
       [5. , 3.6, 1.4, 0.2],
       [5.4, 3.9, 1.7, 0.4],
       [4.6, 3.4, 1.4, 0.3],
       [5. , 3.4, 1.5, 0.2],
       [4.4, 2.9, 1.4, 0.2],
       [4.9, 3.1, 1.5, 0.1]])

In [5]:
factor = FactorAnalysis().fit(X)

df = pd.DataFrame(factor.components_, columns=variable_names)
df

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm)
0,0.706989,-0.158005,1.654236,0.70085
1,0.115161,0.159635,-0.044321,-0.01403
2,-0.0,0.0,0.0,0.0
3,-0.0,0.0,0.0,-0.0


### Interpretation of the results:

1. Factor '0' is highly influencial on: sepal length, petal length and petal width.
2. Factors 2 & 3 have no influence and thus, can be removed
3. Factor 1 has very little influence, therefore there is not much to interpret here

We can say, that Factor 0 is the LATENT VARIABLE.