In this session we see how to: 

1) Implement Exp Factor Analysis in Python
2) Factorability test on the data
3) Choosing the right number of factors

In last session we have seen that if the input variable have high correlation, the model can be bised towards those input variables. One simple way, is to choose only 1 input variable among the highly correlated variables. This will lead to a loss of information and thus we need to use Factor analysis to linearly combine highly correlated variables to form linear factors. 

Each Factor is a linear combination of highly correlated input variable which can be represented by Linear Regression equation. 

NOTE: For more information refer the PDF file. 

We will be usinga data-set for the individuals entering a web-based personality assessment test. There are 25 different personality assessment factors on which the individuals were rated. The factors are labled as A1, A2, A3...O5. The ratings are on the scale of 1 to 6 where: 

1 - Very Inaccurate
2 - Moderately Inaccurate
3 - Slightly inaccurate
4 - Slightly accurate
5 - Moderately accurate
6 - Very accurate

To perform the factor analysis you need to install factor_analyzer

In [1]:
# Lets import the data set and other functions

import pandas as pd
from factor_analyzer import FactorAnalyzer
from factor_analyzer.factor_analyzer import calculate_kmo

# Import the data set

df = pd.read_csv(r"C:\Users\arun\Downloads\Factor Analysis - Data Set.csv")

df.head()

Unnamed: 0,A1,A2,A3,A4,A5,C1,C2,C3,C4,C5,...,N4,N5,O1,O2,O3,O4,O5,gender,education,age
0,2.0,4.0,3.0,4.0,4.0,2.0,3.0,3.0,4.0,4.0,...,2.0,3.0,3.0,6,3.0,4.0,3.0,1,,16
1,2.0,4.0,5.0,2.0,5.0,5.0,4.0,4.0,3.0,4.0,...,5.0,5.0,4.0,2,4.0,3.0,3.0,2,,18
2,5.0,4.0,5.0,4.0,4.0,4.0,5.0,4.0,2.0,5.0,...,2.0,3.0,4.0,2,5.0,5.0,2.0,2,,17
3,4.0,4.0,6.0,5.0,5.0,4.0,4.0,3.0,5.0,5.0,...,4.0,1.0,3.0,3,4.0,3.0,5.0,2,,17
4,2.0,3.0,3.0,4.0,5.0,4.0,4.0,5.0,3.0,2.0,...,4.0,3.0,3.0,3,4.0,3.0,3.0,1,,17


In [2]:
# Lets find the number of rows in the data set

df.shape

# Lets remove the column of gender, education and age

df_new = df.drop(columns = ['gender', 'education', 'age'])

df_new.head()

Unnamed: 0,A1,A2,A3,A4,A5,C1,C2,C3,C4,C5,...,N1,N2,N3,N4,N5,O1,O2,O3,O4,O5
0,2.0,4.0,3.0,4.0,4.0,2.0,3.0,3.0,4.0,4.0,...,3.0,4.0,2.0,2.0,3.0,3.0,6,3.0,4.0,3.0
1,2.0,4.0,5.0,2.0,5.0,5.0,4.0,4.0,3.0,4.0,...,3.0,3.0,3.0,5.0,5.0,4.0,2,4.0,3.0,3.0
2,5.0,4.0,5.0,4.0,4.0,4.0,5.0,4.0,2.0,5.0,...,4.0,5.0,4.0,2.0,3.0,4.0,2,5.0,5.0,2.0
3,4.0,4.0,6.0,5.0,5.0,4.0,4.0,3.0,5.0,5.0,...,2.0,5.0,2.0,4.0,1.0,3.0,3,4.0,3.0,5.0
4,2.0,3.0,3.0,4.0,5.0,4.0,4.0,5.0,3.0,2.0,...,2.0,3.0,4.0,4.0,3.0,3.0,3,4.0,3.0,3.0


In [3]:
# Lets check for the missing values in the columns

df_new.isnull().sum()

# Lets see how many rows will be deleted if we drop na's

df_final = df_new.dropna()

df_final.shape

# As after removing the na valus some less than 400 rows are deleted from the data set which still give me 2400+ rows. 

(2436, 25)

Factorability - It means that is there any patterns in the data like correlations whch can help us to find usefull factors from the data set. This can be done using a hypothesis test by Kasier - Mever - Olkin test (kmo test). This test also indicate the statistically important proportion of variance in our variables. In simple terms, higher (closer to 1) the value KMO test, indicate that factor analysis will be usefull for the data, lower the value (closer to 0) indicate that factor analysis will not be usefull for the data. 

In most of the cases, any KMO value of less than 0.50 will indicate that applying factor analysis will be of no use on the data. 

In [4]:
# Lets find the KMO test result

kmo_all, kmo_model = calculate_kmo(df_final)

print(kmo_model)

0.8485397221949222


Choosing the number of factors: The factors with eigenvalues greater than 1 are considered as the good number of factor variables. We first calcuate the eigenvalues for all the factor (Number of factors equal to the number of column in your data set), and then we select the count of only those factors where the eigenvalues is greater than 1. 

In [5]:
# Lets create 25 factors to find the right number of factors inwhich this data can be divided

# STEP 1: Create a factor object
# Syntax - FactorAnalyzer(n_factors = number of factor, rotation = None)

fa = FactorAnalyzer(n_factors=25, rotation = None)

# STEP 2: Fit the object on data frame 

fa.fit(df_final)

# STEP 3: WE will find the eigenvalues using a command get_eigenvalues()

eigen_values, eigen_vector = fa.get_eigenvalues()

eigen_values

array([5.13431118, 2.75188667, 2.14270195, 1.85232761, 1.54816285,
       1.07358247, 0.83953893, 0.79920618, 0.71898919, 0.68808879,
       0.67637336, 0.65179984, 0.62325295, 0.59656284, 0.56309083,
       0.54330533, 0.51451752, 0.49450315, 0.48263952, 0.448921  ,
       0.42336611, 0.40067145, 0.38780448, 0.38185679, 0.26253902])

From the above eigen values we will count how many of them are greater than 1. That will be the number of optimal factors in which this data can be divided. Thus, we can say that the columns of this data can be clubed into 6 factors as we have 6 eigen values greater than 1. 

In [6]:
# Lets create the final factor model for n_factor = 6

fa_final = FactorAnalyzer(n_factors=6, rotation = 'varimax')

# fitting the object

fa_final.fit(df_final)

# printing the loading factor - Which percentage variable explained by each factor for each column

fa_final.loadings_

# Lets convert this to a data frame

loading_df = pd.DataFrame(fa_final.loadings_, columns=['Factor1', 'Factor2', 'Factor3', 'Factor4', 'Factor5', 'Factor6'],
                         index=df_final.columns)

loading_df


Unnamed: 0,Factor1,Factor2,Factor3,Factor4,Factor5,Factor6
A1,0.09522,0.040783,0.048734,-0.530987,-0.113057,0.161216
A2,0.033131,0.235538,0.133714,0.661141,0.063734,-0.006244
A3,-0.009621,0.343008,0.121353,0.605933,0.03399,0.160106
A4,-0.081518,0.219717,0.23514,0.404594,-0.125338,0.086356
A5,-0.149616,0.414458,0.106382,0.469698,0.030977,0.236519
C1,-0.004358,0.077248,0.554582,0.007511,0.190124,0.095035
C2,0.06833,0.03837,0.674545,0.057055,0.087593,0.152775
C3,-0.039994,0.031867,0.551164,0.101282,-0.011338,0.008996
C4,0.216283,-0.066241,-0.638475,-0.102617,-0.143846,0.318359
C5,0.284187,-0.180812,-0.544838,-0.059955,0.025837,0.132423


# Lets create the correlation matrix of factor loading

# We will continue in next class