# Day3 Review


## How to approach this (or analogous) problem(s):

1) __EDA__: 
* Calculate correlation metrics amongst all variables
* Look for potential collinearity amongst explanatory variables
* Examine distributions + summary statistics of all variables + comment on their appearance
* Plot relationships between each potential explanatory variable and the response variable
* Derive inferences
* Identify (and impute) missing values

2) __Data Preparation__: 
* Check which imputation method will be most effective 
* Check if the data is normal and decide how to handle it if it is not. (Normalization or standardization)

3) __Dimensionality Reduction + Feature Selection__: 
* Based on the results of your EDA, select the explanatory variables you believe are likely to be the most useful within your model(s). 
* Exclude variables that are highly correlated with one another (i.e., "collinear"). 
* Exclude variables that exhibit low variance. 
* Consider excluding variables that exhibit little relation with the response variable. 
* Do all of this __before__ use of PCA or recursive feature elimination or VIF's or p-value analysis. 

## Potential Models:

- Apply PCA to continuous numeric data; select some number of PC's; use selected PC's as the basis of a regression model (excluding all categorical explanatory variables).

- Apply PCA to continuous numeric data; select some number of PC's; use selected PC's + a subset categorical variables as the basis of a regression model. Refine model via use of backward and bi-directional selection and VIFs.

- Use correlation thresholds, then recursive feature elimination (p-value analysis), then VIFS to produce a model with a small number of statistically significant and non-collinear variables


## Use the results of your EDA as the starting point for all downstream work

When asked to apply dimensionality reduction and/or feature selection methods to a data set, we should rely upon the results of our EDA work, i.e., our starting point should be the use of the correlation metrics and preliminary predictive inferences we've derived from the data.

We should __NOT__ start by simply throwing all of the data we've been given into a backward selection process or PCA: why bother with an EDA if you are simply going to ignore its results? The results of a thorough EDA will typically allow us to construct our models in a much more efficient and effective manner than will simply throwing our hands up and arbitrarily forcing all of our data into a model.

## Avoid the use of Python-based tools that you don't fully understand

While Python (and many other languages) often provide very simple + highly abstracted tools that enable the implementation of very complex concepts via a very small amount of Python code, we should avoid the use of tools we don't fully understand. Improper use of such highly abstracted tools without sufficient knowledge of their underlying algorithms can easily result in our work being compromised by inaccurate and/or irrelevant results/output while also requiring the computation of large amounts of potentially unnecessary calculations.


## PCA

__Can we apply PCA to categorical features that have been converted to nominal numeric values (e.g., via one-hot encoding or label enconding)?__

The answer is __NO__, we should __NEVER__ apply PCA to __ANY__ type of categorical data, even if the categorical information has been converted to (or was always in) numerical format. __PCA__ is meant to be __applied to continuous variables__, for which it tries to maximize the variance (i.e, the squared deviations) of the data. The concept of squared deviations doesn't really exist when applied to binary or label encoded data. 

By contrast, __categorical data is measured on a nominal scale__ meaning that the category spacing has no interval/ratio meaning. For example, consider the "symboling" discussed previously: "symboling" is an __ordinal categorical variable__ having possible values of (-3, -2,..,2, 3). __No meaningful mathematical result can be derived from the addition, subtraction, multiplication, or division of a categorical variable's possible data values__ since the nominally numeric values __are not cardinal numbers__. So while we can say that a "symboling" value of -3 is preferable to a "symboling" value of 2, we cannot derive meaning from the addition or subtraction these nominal values.

So while you can obtain an output from a PCA algorithm based on numeric encodings of categorical inputs, the output is highly unlikely to have any relevant "meaning" (i.e., garbage in ... garbage out).

_Although_ not everyone agrees, and there are other options:
1. [CATPCA](https://www.ibm.com/support/knowledgecenter/en/SSLVMB_23.0.0/spss/categories/idh_cpca.html)
2. [Another approach](https://people.orie.cornell.edu/mru8/doc/udell15_pca_dataframe.pdf)
3. [Yet another approach](https://arxiv.org/pdf/1410.7404.pdf)
4. [And yet another approach](http://papers.nips.cc/paper/2078-a-generalization-of-principal-components-analysis-to-the-exponential-family.pdf) 

__Remove highly correlated features prior to PCA__

Retaining highly correlated features can cause PCA to __over-emphasize__ the contribution of the highly correlated variables within the principal components + potentially change the direction of the associated eigenvectors and/or the magnitude of the associated eigenvalues. Here's a link to a fairly good explanation of this phenomena [link](https://stats.stackexchange.com/questions/50537/should-one-remove-highly-correlated-variables-before-doing-pca)

So the answer to the question is __YES__, we should attempt to remove features that appear to be highly correlated with one another prior to applying PCA to a set of continuous numeric data.


## Variance Inflation Factors

Variance inflation factors (VIFs) are an __OUTPUT__ of __a series of regression models__. To calculate VIFs for a set of explanatory variables, we need to regress every explanatory variable against every other possible explanatory variable. That's $N * (N-1)$ regression models !!!

Therefore, we should __NOT__ be using VIFs __before__ we've attempted to remove highly correlated explanatory variables from a data set. VIFs are appropriately derived __from the output of regression models we have constructed using the knowledge we've gained from our EDA work__. This avoids the use of many arbitrary + unnecessary calculations while also __contextualizing the VIFs relative to a model that has been constructed via a process informed inquiry__, as opposed to an arbitrary calculation of VIFs prior to the application of the domain knowledge we develop via an EDA process. 