# Sources

Srinivasan, Vitthal. (8 Mar 2017) Understanding and Applying Factor Analysis and PCA, Pluralsight. Available at https://app.pluralsight.com/library/courses/understanding-applying-factor-analysis-pca/table-of-contents

# Cutting Through Clutter with Factor Analysis

Factor analysis is different from Regression. Regression looks to connect the dots, while factor analysis aims to cut through the clutter.

Regression analysis is plagued by multicolliniarity. That happens when the independent variables are highly correlated with each other and contain the "same information" for analysis purposes. 

![image.png](attachment:image.png)

FA looks to get under the surface of the feature set.

**Exploratory Factor Analysis:** Experts trace back principle components to observable factors. PCA is used to generate the components. Experts use that analysis to drive intuition.



# What & How: Factor Analysis & PCA

PCA is an application of FA. PCA is a cookie-cutter technique that finds the 'good' factors from a set of data points. PCA will identify and result in independent factors. However, they may not be intuitive.

![image.png](attachment:image.png)


**Dimensionality:** # of features (columns) in the data set

# Rule Based Binary Classifer


![image-2.png](attachment:image-2.png)

## Example using Myers-Briggs

![image.png](attachment:image.png)


Human experts identify and extract factors. The model started out tracking a large number of features. Then, these features were grouped and distilled using a rule-based extraction technique to create the prevailing 5 dimensional model.



# ML Based Binary Classifier

![image.png](attachment:image.png)


# Other Examples of Interest

## Latent Factors in Stock Returns
- market movement
- interest rates
- industry sectors

## Latent Factors in Bond Returns
- Trend
- Tilt
- Convexity

# Ituition Behind PCA

In general, there are as many principle components as there are dimensions in the original data. Principle components can be thought of as the directions along which data points defer.

![image.png](attachment:image.png)

The components then become new axes and the data can be reoriented along these new axes.
![image-2.png](attachment:image-2.png)

If the variance along a second component is small, we can ignore it and use fewer dimensions to represent the data. This is called shedding the dimension.
![image-3.png](attachment:image-3.png)

As dimensionality explodes, PCA is becoming more and more popular.

# PCA in Python

![image.png](attachment:image.png)

In [1]:
import pandas as pd
import numpy as np

In [None]:
# PCA should always be done on standardized data, (mean 0, STDV 1)
eig_val, eig_vec = np.linalg.eig(data_standardized.cov())
 

In [None]:
# each eigen value coresponds to the variance of each feature
eig_val

In [None]:
# sum of eig_val corresponds to the variance of the original data
# since the var is standardized the total variance will be equal to the number of features
sum(eig_val)

In [None]:
# create a set of pairs with the first element as the eigen value and the second element is the eigen vector
eig_pairs = [(np.abs(eig_val[i]),eig_vec[:,i]) for i in range(len(eig_val))]

In [None]:
# sort the set
eig_pairs.sort(key=lambda x:x[0], reverse=True)

In [None]:
#    this will print the eigen values in decending order
for i in eig_pair:
    print(i[0])

In [None]:
# access the eigen vector
eig_pairs[0][1]

In [None]:
#assign the vectors to new features
eVector1 = eig_pairs[0][1]
eVector2 = eig_pairs[1][1]
eVector3 = eig_pairs[2][1]

In [None]:
# we can calculate the principle component as the dot-product of the original data and the eigen vector

pca1 = np.dot(stdYVars,eVector1.reshape(-1,1))

In [None]:
# the first reshape allows us to do the dot-product. The second puts the data back to a columnar format
pca1 = np.dot(stdYVars,eVector1.reshape(-1,1)).reshape(1,-1)
pca2 = np.dot(stdYVars,eVector2.reshape(-1,1)).reshape(1,-1)
pca3 = np.dot(stdYVars,eVector3.reshape(-1,1)).reshape(1,-1)

In [None]:
# reshape the data for regression
xData = np.array(zip(pca1.T, pca2.T, pca3.T)).reshape(-1,3)

In [None]:
yData = standardized_dependentVar

# PCA for Data Viz

# PCA using SciKit Learn

In [None]:
from sklearn.preprocessing import StandardScalar
from sklearn.decomposition import PCA

In [2]:
#enstantiate transformer and assign variables
#X = feature set
#y = target

X = StandardScalar().fit_transform(x)



In [None]:
pca = PCA(n_components=2)
principleComponents = pca.fit_transform(x)
pca.explained_variance_

In [None]:
pca.explained_variance_ratio_

In [None]:
PC_values = np.arange(pca.n_components_) + 1
plt.plot(PC_values, pca.explained_variance_ratio_, 'ro-', linewidth=2)
plt.title('Scree Plot')
plt.xlabel('Principal Component')
plt.ylabel('Proportion of Variance Explained')
plt.show()

In [None]:
principleDf = pd.DataFrame(data = principleComponents, columns = ['pc_1', 'pc_2'])

### Variance Ratio

In [None]:
# returns the %_variance provided by each feature
pca.explained_variance_ratio

In [None]:
# use the min number of components such that .95 of the variance is contained
pca = PCA(.95)

In [None]:
# once PCA has been done you will apply the transformation (prediction) to both the training and test data
# this applys the change to principle components to your data for further analysis

x_train_tf = pca.transform(x_train)
x_test_tf  = pca.transform(x_test)

#you can then apply another ML algo

# Higher Dimensionality Return Trip

We can use the compressed data to get back to our original data via an approximation

In [None]:
approximation = pca.inverse_transformation(lower_dimensional_data)