# Factor Analysis

## Introduction

https://www.analyticsvidhya.com/blog/2020/10/dimensionality-reduction-using-factor-analysis-in-python/

https://www.datacamp.com/community/tutorials/introduction-factor-analysis

What is Factor Analysis?

>Factor analysis is one of the unsupervised machine learning algorithms which is used for dimensionality reduction. Factor analysis is a linear statistical model. It is used to explain the variance among the observed variable and condense a set of the observed variable into the unobserved variable called factors. Observed variables are modeled as a linear combination of factors and error terms. Factor or latent variable is associated with multiple observed variables, who have common patterns of responses. Each factor explains a particular amount of variance in the observed variables. It helps in data interpretations by reducing the number of variables.

>Factor Analysis (FA) is an exploratory data analysis method used to search influential underlying factors or latent variables from a set of observed variables. It helps in data interpretations by reducing the number of variables. It extracts maximum common variance from all variables and puts them into a common score.

What is Factor?

>A factor is a latent variable which describes the association among the number of observed variables. The maximum number of factors are equal to a number of observed variables. Every factor explains a certain variance in observed variables. The factors with the lowest amount of variance were dropped. Factors are also known as latent variables or hidden variables or unobserved variables or Hypothetical variables.

Uses of Factor Analysis:

>Factor analysis is widely utilized in market research, advertising, psychology, finance, and operation research. Market researchers use factor analysis to identify price-sensitive customers, identify brand features that influence consumer choice, and helps in understanding channel selection criteria for the distribution channel.

What are the types of Factor Analysis?

>There are 2 types of Factor Analysis:
- Exploratory Factor Analysis: It is the most popular factor analysis approach among social and management researchers. Its basic assumption is that any observed variable is directly associated with any factor.
- Confirmatory Factor Analysis (CFA): Its basic assumption is that each factor is associated with a particular set of observed variables. CFA confirms what is expected on the basic.

How does Factor Analysis work?

>The primary objective of factor analysis is to reduce the number of observed variables and find unobservable variables. These unobserved variables help the market researcher to conclude the survey. This conversion of the observed variables to unobserved variables can be achieved in the following steps:

- Factor Extraction: In this step, the number of factors and approach for extraction selected using variance partitioning methods such as principal components analysis and common factor analysis.
- Factor Rotation: Rotation is a tool for better interpretation of factor analysis. Rotation can be orthogonal or oblique. It re-distributed the commonalities with a clear pattern of loadings. In this step, rotation tries to convert factors into uncorrelated factors — the main goal of this step to improve the overall interpretability. There are lots of rotation methods that are available such as: Varimax rotation method, Quartimax rotation method, and Promax rotation method.

Steps of Factor Analysis:

- Bartlett’s Test of Sphericity and KMO Test
- Determining the number of factors
- Interpreting the factors

Before proceeding with Factor Analysis, we need to take care of following steps with the data:

- There are no outliers in data.
- Sample size should be greater than the factor.
- There should not be perfect multicollinearity.
- There should not be homoscedasticity between the variables.
- The data should be Standard scaled. 
- The features have to be numeric.

## Case Study

Problem Statement:

>The inner beauty is always cherished rather than the outer appearance. These lines show that the character of a person is more important rather than his appearance. Business firms these days take these into considerations and aim at selecting a person rather than a talent employee.Thus, deciding the personality of a person becomes necessary for firms to increase their productivity. Here, we are given the scores for various personalities of a person we try to reduce them and bring the unobserved feature or behavior into consideration. This can be done with the help of dimension reduction techniques such as the Factor Analysis. The dataset had scores for various personalities for a person ranging from 1 to 10. The various personalities given are "distant", "talkatv", "carelss", "hardwrk", "anxious","agreebl", "tense", "kind", "opposng", "relaxed”,"disorgn", "outgoin", "approvn", "shy", "discipl","harsh", "persevr", "friendl", "worryin", "respnsi","contrar", "sociabl", "lazy", "coopera", "quiet","organiz", "criticl", "lax", "laidbck", "withdrw","givinup", "easygon”.The dataset had about 292 instances.

Approach:

>Initially, the data is checked for any null values. Later, the data are scaled using the standard scaling technique. Then, the scaled data are passed through various tests such as the Bartlett’s test of sphericity and the KMO test to determine whether the dimensionality reduction techniques such as the Factor Analysis can be applied on this dataset. With the help of Scree plot,
the optimal number of factors are determined. Then the Factor Analysis is implemented using the Factor Analysis Module.

## Content

1. <a href = "#1.-Loading-the-Dataset">Loading the Dataset</a>
2. <a href = "#2.-Data-Preprocessing">Data Preprocessing</a>
3. <a href = "#3.-Adequacy-Test">Adequacy Test</a>
    - Bartlett's Test of Sphericity
    - Kaiser-Meyer-Olkin (KMO) Test
4. <a href = "#4.-Determining-the-Number-of-Factors">Determining the Number of Factors</a>
5. <a href = "#5.-Interpreting-the-Factors">Interpreting the Factors</a>
    - Loadings
    - Variance
    - Communalities
6. <a href = "#6.-Inference">Inference</a>
7. <a href = "#7.-Pros-and-Cons-of-Factor-Analysis">Pros and Cons of Factor Analysis</a>
8. <a href = "#8.-Factor-Analysis-Vs.-Principle-Component-Analysis">Factor Analysis Vs. Principle Component Analysis</a>

### Libraries

In [1]:
import os
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler
from factor_analyzer import FactorAnalyzer
from factor_analyzer.factor_analyzer import calculate_bartlett_sphericity,calculate_kmo

## 1. Loading the Dataset

In [2]:
pip install factor_analyzer


Note: you may need to restart the kernel to use updated packages.



[notice] A new release of pip available: 22.2.2 -> 22.3
[notice] To update, run: python.exe -m pip install --upgrade pip


In [3]:
#Reading the Dataset
#basepath='../../Data\\'
#with open('C:\Somashree\Project\DS Code Repository\Factor Analysis\Standford.txt','r') as file:
#with open(basepath+'Standford.txt','r') as file:
file = "Factor Analysis\Standford.txt"
path = os.getcwd()+file
fp = open(path, 'r+');
header = file.readline()
data=[]
for row in file.readlines()[1:]:
    row = row.split()[1:]
    data.append(row)
data = np.array(data,dtype='int')

FileNotFoundError: [Errno 2] No such file or directory: 'C:\\Users\\KVSH2\\Novartis Pharma AG\\Data Insights and Analytics (DiA) Team - DS Code Repository\\Code\\Factor AnalysisFactor Analysis\\Standford.txt'

In [None]:
dataset = pd.DataFrame(data,columns=np.array(header.split(),dtype=object))
dataset.head()

In [None]:
np.array(list(dataset.columns),dtype=object)

[<a href="#Content">Back to Content</a>]

## 2. Data Preprocessing

In [None]:
dataset.isnull().sum()

In [None]:
scaler =  StandardScaler()
dataframe = scaler.fit_transform(dataset)
dataframe = pd.DataFrame(data=dataframe,columns=dataset.columns)
dataframe.head(10)

[<a href="#Content">Back to Content</a>]

## 3. Adequacy Test

The scaled data are passed through various tests to determine whether the dimensionality reduction techniques such as the Factor Analysis can be applied on this dataset. There are two methods to check the factorability or sampling adequacy:
- Bartlett's Test of Sphericity
- Kaiser-Meyer-Olkin Test

### Bartlett's Test of Sphericity

Bartlett’s test checks whether the correlation is present in the given data. It tests the null hypothesis (H0) that the correlation matrix is an Identical matrix. The identical matrix consists of all the diagonal elements as 1. So, the null hypothesis assumes that no correlation is present among the variables.

We want to reject this null hypothesis because factor analysis aims at explaining the common variance i.e. the variation due to correlation among the variables. If the p test statistic value is less than 0.05, we can decide that the correlation is not an Identical matrix i.e. correlation is present among the variables.

In [None]:
chi2,p = calculate_bartlett_sphericity(dataframe)
print("Bartlett Sphericity Test")
print("Chi squared value : ",chi2)
print("p value : ",p)

>Since the p test statistic is less than 0.05, we can conclude that correlation is present among the variables which is a green signal to apply factor analysis.

### Kaiser-Meyer-Olkin (KMO) Test

KMO Test measures the proportion of variance that might be a common variance among the variables. Larger proportions are expected as it represents more correlation is present among the variables thereby giving way for the application of dimensionality reduction techniques such as Factor Analysis. KMO score is always between 0 to 1 and values more than 0.6 are much appreciated. We can also say it as a measure of how suited our data is for factor analysis. 
Just pass the dataframe which contains information about the dataset to the calculate_kmo function. The function will return the proportion of variance for each variable which is stored in the variable ‘kmo_vars’ and the proportion of variance for the whole of our data is stored in ‘kmo_model’.

In [None]:
kmo_all,kmo_model = calculate_kmo(dataset)
print("KMO Test Statisitc",kmo_model)

>We can see that our data has an overall proportion of variance of 0.84. It shows that our data has more correlation and dimensionality reduction techniques such as the factor analysis can be applied.

[<a href="#Content">Back to Content</a>]

## 4. Determining the Number of Factors

The number of factors in our dataset is equal to the number of variables in our dataset. All the factors are not gonna provide a significant amount of useful information about the common variance among the variables. So we have to decide the number of factors. The number of factors can be decided on the basis of the amount of common variance the factors explain. In general, we will plot the factors and their eigenvalues.

**Eigenvalues** are nothing but the amount of variance the factor explains. It represent variance explained each factor from the total variance. We will select the number of factors whose eigenvalues are greater than 1. It is also known as characteristic roots.

But why should we choose the factors whose eigenvalues are greater than 1? 

>In a standard normal distribution with mean 0 and Standard deviation 1, the variance will be 1. Since we have standard scaled the data the variance of a feature is 1. This is the reason for selecting factors whose eigenvalues(variance) are greater than 1 i.e. the factors which explain more variance than a single observed variable.

In [None]:
fa = FactorAnalyzer(rotation = None,impute = "drop",n_factors=dataframe.shape[1])

In [None]:
fa.fit(dataframe)

In [None]:
ev,_ = fa.get_eigenvalues()

In [None]:
plt.scatter(range(1,dataframe.shape[1]+1),ev)
plt.plot(range(1,dataframe.shape[1]+1),ev)
plt.title('Scree Plot')
plt.xlabel('Factors')
plt.ylabel('Eigen Value')
plt.grid()

>The eigenvalues function will return the original eigenvalues and the common factor eigenvalues. Now, we are going to consider only the original eigenvalues. From the graph, we can see that the eigenvalues drop below 1 from the 7th factor. So, the optimal number of factors is 6.

[<a href="#Content">Back to Content</a>]

## 5. Interpreting the Factors

Create an optimal number of factors i.e. 6 in our case. Then, we have to interpret the factors by making use of:
- Loadings
- Variance
- Communalities

### Loadings

The factor loading is a matrix which shows the relationship of each variable to the underlying factor. It shows the correlation coefficient for observed variable and factor. It shows the variance explained by the observed variables. Loadings indicate how much a factor explains a variable. The loading score will range from -1 to 1.Values close to -1 or 1 indicate that the factor has an influence on these variables. Values close to 0 indicates that the factor has a lower influencer on the variable.

In [None]:
fa = FactorAnalyzer(n_factors=6,rotation='varimax')
fa.fit(dataset)

In [None]:
print(pd.DataFrame(fa.loadings_,index=dataframe.columns))

>In Factor 0, we can see that the features ‘distant’ and ‘shy’ talkative have high loadings than other variables. From this, we can see that Factor 0, explains the common variance in people who are reserved i.e. the variance among the people who are distant and shy.

### Variance

The amount of variance explained by each factor can be found out using the ‘get_factor_variance’ function. 

In [None]:
print(pd.DataFrame(fa.get_factor_variance(),index=['Variance','Proportional Var','Cumulative Var']))

>The first row represents the variance explained by each factor. Proportional variance is the variance explained by a factor out of the total variance. Cumulative variance is nothing but the cumulative sum of proportional variances of each factor. In our case, the 6 factors together are able to explain 55.3% of the total variance.

### Communalities

Communalities are the sum of the squared loadings for each variable. It represents the common variance. It ranges from 0-1 and value close to 1 represents more variance.Communality is the proportion of each variable’s variance that can be explained by the factors. Rotations don’t have any influence over the communality of the variables.

In [None]:
print(pd.DataFrame(fa.get_communalities(),index=dataframe.columns,columns=['Communalities']))

>The proportion of each variable’s variance that is explained by the factors can be inferred from the above. For example, we could consider the variable ‘talkatv’ about 62.9% of its variance is explained by all the factors together.

[<a href="#Content">Back to Content</a>]

## 6. Inference

The Bartlett’s test of sphericity had a p test statistic of 0.0 at 95% confidence which states
that the correlation matrix is not an identity matrix i.e. correlation is present among the variables.
The overall KMO statistic value is 0.88 which states the sampling is adequate and thus providing
the way for applying factor analysis. The optimal number of factors is 6 as their eigen values are
above 1 which can also be inferred from the scree plot. Factor explains creates factor which can
explain the amount of variance due to correlation among the variables. The 6 factors
cumulatively explain about 55% of the common variance where the factor 1 leads with explaining
about 14% of the common variance. Thus, factor analysis has helped us reducing the dimensions
by introducing factors where each factor has helped in explaining the variance due to the
correlation among the variables.

[<a href="#Content">Back to Content</a>]

## 7. Pros and Cons of Factor Analysis

Factor analysis explores large dataset and finds interlinked associations. It reduces the observed variables into a few unobserved variables or identifies the groups of inter-related variables, which help the market researchers to compress the market situations and find the hidden relationship among consumer taste, preference, and cultural influence. Also, it helps in improve questionnaire in for future surveys. Factors make for more natural data interpretation.

[<a href="#Content">Back to Content</a>]

## 8. Factor Analysis Vs. Principle Component Analysis

- PCA components explain the maximum amount of variance while factor analysis explains the covariance in data.
- PCA components are fully orthogonal to each other whereas factor analysis does not require factors to be orthogonal.
- PCA component is a linear combination of the observed variable while in FA, the observed variables are linear combinations of the unobserved variable or factor.
- PCA components are uninterpretable. In FA, underlying factors are labelable and interpretable.
- PCA is a kind of dimensionality reduction method whereas factor analysis is the latent variable method.
- PCA is a type of factor analysis. PCA is observational whereas FA is a modeling technique.

[<a href="#Content">Back to Content</a>]

In [None]:
#found errors": Dataset is not available