# <b>Factor Analysis</b>

Factors: Liner combination of original variables
Process of combining highly correlated variables is called factor analysis

Linear combinations of correlated variables

In factor analysis we combine the features having low variance or which are correlated to each other.


## Objective:
- Reduce numer of variables / decrease dimensionality
- Examine relationships between variables
- Address problem of milticollinearily

## Assumption:
- Normalized data
- Factors are independent of each other
- There exists some underlying factors that can describe the original variables

## Type of FA:
<b>1) Exploratory FA:</b>
- ID relationships among variables
- Group variables that are part of similar concept
- No prior assumption abt number or relationshop among factors

<b>2) Confirmatory FA:</b>
- Assumption regarding num of factors. We create a hypothesis that there are N number of factors which can represent or describe the data.
- Test hypothesis that variables are associated with n specific factors.

## Steps:
1) Validate the data: 
    - Sample size. Sample size > 1000 for excellent performance
    - Sample to Variable ratio. This ratio should be around 15:1
    - Correlation values of variables. If correlation is very less (<0.3), then FA might not be a good technique

2) Extract the factors:
    - Assumptions:
        - Error terms are independent of one another
        - Factors are independent of one another as well as error terms

3)


## Factor Lodings:
- represent the relationship of each variable with underlying factors



In [84]:
# import required libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error
from sklearn.decomposition import FactorAnalysis

In [85]:
train = pd.read_csv('data.csv')

In [86]:
train.head()

Unnamed: 0,Item_Identifier,Item_Weight,Item_Fat_Content,Item_Visibility,Item_Type,Item_MRP,Outlet_Identifier,Outlet_Establishment_Year,Outlet_Size,Outlet_Location_Type,Outlet_Type,Item_Outlet_Sales
0,FDA15,9.3,Low Fat,0.016047,Dairy,249.8092,OUT049,1999,Medium,Tier 1,Supermarket Type1,3735.138
1,DRC01,5.92,Regular,0.019278,Soft Drinks,48.2692,OUT018,2009,Medium,Tier 3,Supermarket Type2,443.4228
2,FDN15,17.5,Low Fat,0.01676,Meat,141.618,OUT049,1999,Medium,Tier 1,Supermarket Type1,2097.27
3,FDX07,19.2,Regular,0.0,Fruits and Vegetables,182.095,OUT010,1998,,Tier 3,Grocery Store,732.38
4,NCD19,8.93,Low Fat,0.0,Household,53.8614,OUT013,1987,High,Tier 3,Supermarket Type1,994.7052


In [87]:
train.isnull().sum()

Item_Identifier                 0
Item_Weight                  1463
Item_Fat_Content                0
Item_Visibility                 0
Item_Type                       0
Item_MRP                        0
Outlet_Identifier               0
Outlet_Establishment_Year       0
Outlet_Size                  2410
Outlet_Location_Type            0
Outlet_Type                     0
Item_Outlet_Sales               0
dtype: int64

In [88]:
train.Item_Weight.median()

12.6

In [89]:
train.Outlet_Size.fillna(train.Outlet_Size.mode()[0], inplace=True)
train.Item_Weight.fillna(train.Item_Weight.median(), inplace=True)


In [90]:
train.shape

(8523, 12)

In [92]:
train_updated = train.drop(["Item_Identifier", "Outlet_Identifier"], axis=1)

In [93]:
train_updated.nunique()

Item_Weight                   415
Item_Fat_Content                5
Item_Visibility              7880
Item_Type                      16
Item_MRP                     5938
Outlet_Establishment_Year       9
Outlet_Size                     3
Outlet_Location_Type            3
Outlet_Type                     4
Item_Outlet_Sales            3493
dtype: int64

In [94]:
# We need to convert categorical data into numericals
train_updated = pd.get_dummies(train_updated)

In [95]:
train_updated.shape

(8523, 36)

In [122]:
# separating the target variable
df = train_updated.drop('Item_Outlet_Sales',1)
target = train_updated['Item_Outlet_Sales']

In [123]:
# creating the training and validation set
X_train, X_valid, y_train, y_valid = train_test_split(df, target, random_state = 10, test_size = 0.25)

## Checking the assumptions of applying Factor Analysis

Sample size. Sample size > 1000 for excellent performance

In [124]:
X_train.shape[0]

6392

Sample to Variable ratio. This ratio should be around 15:1

In [125]:
X_train.shape[0] / X_train.shape[1]

182.62857142857143

Correlation values of variables. If correlation is very less (<0.5), then FA might not be a good technique

In [130]:
cor = X_train.corr().abs()
cor

s = cor.unstack()
so = s.sort_values(ascending=False)

In [131]:
count = 0
for i in so:
    if(i < 1 and i >= 0.5):
        count = count+1

count

18

In [132]:
# creating the random forest regressor model
model = RandomForestRegressor(random_state=1, max_depth=3, n_estimators=100)

In [133]:
model.fit(X_train, y_train)

RandomForestRegressor(max_depth=3, random_state=1)

In [135]:
pred1 = model.predict(X_valid)

In [140]:
np.sqrt(mean_squared_error(pred1, y_valid)), np.sqrt(mean_squared_error(model.predict(X_train), y_train)) 

(1178.0230078335858, 1144.2150863170398)

## Factor Analysis

There were 9 highly corelated features. So, we will create a FA with the 9 factors

In [141]:
fa = FactorAnalysis(n_components=9)

In [142]:
X_train_transformed = fa.fit_transform(X_train)
X_valid_transformed = fa.fit_transform(X_valid)

In [144]:
model = RandomForestRegressor(random_state=1, max_depth=3, n_estimators=100)

In [146]:
model.fit(X_train_transformed, y_train)

RandomForestRegressor(max_depth=3, random_state=1)

In [148]:
np.sqrt(mean_squared_error(model.predict(X_valid_transformed), y_valid)), np.sqrt(mean_squared_error(model.predict(X_train_transformed), y_train))

(1169.664223904249, 1137.4631254119918)

As we can see, error has decreased in both the cases

In [149]:
# correlation between transformed variables
pd.DataFrame(X_train_transformed).corr()

Unnamed: 0,0,1,2,3,4,5,6,7,8
0,1.0,-2.022603e-16,5.192967e-16,-4.904165e-16,-1.126949e-14,-3.390829e-15,8.005896e-15,1.092147e-14,-2.695476e-13
1,-2.022603e-16,1.0,7.188844e-16,3.179326e-16,-1.029723e-14,3.72332e-15,-2.561599e-14,-5.736207e-14,1.974648e-12
2,5.192967e-16,7.188844e-16,1.0,1.368673e-17,2.790775e-14,4.186363e-14,-1.086952e-13,7.818612e-14,-5.862706e-12
3,-4.904165e-16,3.179326e-16,1.368673e-17,1.0,-3.485096e-14,-1.441964e-13,-3.057529e-13,1.05194e-13,-3.969464e-12
4,-1.126949e-14,-1.029723e-14,2.790775e-14,-3.485096e-14,1.0,1.310354e-14,6.216524e-16,-8.080253e-17,8.379436e-16
5,-3.390829e-15,3.72332e-15,4.186363e-14,-1.441964e-13,1.310354e-14,1.0,4.679221e-15,-6.886128e-15,-1.162077e-13
6,8.005896e-15,-2.561599e-14,-1.086952e-13,-3.057529e-13,6.216524e-16,4.679221e-15,1.0,3.574362e-13,3.217976e-12
7,1.092147e-14,-5.736207e-14,7.818612e-14,1.05194e-13,-8.080253e-17,-6.886128e-15,3.574362e-13,1.0,1.047293e-10
8,-2.695476e-13,1.974648e-12,-5.862706e-12,-3.969464e-12,8.379436e-16,-1.162077e-13,3.217976e-12,1.047293e-10,1.0


In [150]:
# arranging the correlation in descending order
c = pd.DataFrame(X_train_transformed).corr().abs()
s = c.unstack()
so = s.sort_values(ascending=False)

In [151]:
# number of transformed variables having correlation more than 0.1
count=0
for i in range(len(so.values)):
    if so.values[i] < 1.0 and so.values[i] >= 0.1:
        count = count + 1
print(count)

0


Also, we can see that coorelation between the factors is 0