## Factor Analysis

**Factor analysis** is a regression method you can apply to discover *root causes* or *hidden factors* that are present in the dataset, but not observable.  

Factors are also called latent variables. Latent variables are variables that are meaningful, but that are inferred and not directly observable. T

**Assumptions:**
- Metric features
- Features are either continuous or ordinal
- Correlation coefficient **r** > 0.3
- Observations > 100 *and* > 5 observations per feature.
- Sample is homogenous.

Factor analysis outputs **loadings**.
- Loadings near to -1 or 1 mean that factor has a strong influence on the variable. 
- Loadings that are close to zero mean that the factor weakly influences the variable. 
- Loadings > 1 mean that these are highly correlated factors.

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from pylab import rcParams

import sklearn as sk
from sklearn.datasets import load_iris
from sklearn.decomposition import FactorAnalysis


In [9]:
iris = load_iris()

data = iris.data
target = iris.target

df = pd.DataFrame(data, columns = iris.feature_names)
df['Target'] = target

df

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm),Target
0,5.1,3.5,1.4,0.2,0
1,4.9,3.0,1.4,0.2,0
2,4.7,3.2,1.3,0.2,0
3,4.6,3.1,1.5,0.2,0
4,5.0,3.6,1.4,0.2,0
...,...,...,...,...,...
145,6.7,3.0,5.2,2.3,2
146,6.3,2.5,5.0,1.9,2
147,6.5,3.0,5.2,2.0,2
148,6.2,3.4,5.4,2.3,2


In [10]:
factor = FactorAnalysis()
factor.fit(data)

data_fitted = pd.DataFrame(factor.components_, columns = iris.feature_names)
data_fitted

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm)
0,0.706989,-0.158005,1.654236,0.70085
1,0.115161,0.159635,-0.044321,-0.01403
2,-0.0,0.0,0.0,0.0
3,-0.0,0.0,0.0,-0.0
