# DIMENTIONALITY REDUCTION INTUITION
Remember in Part 3 - Classification, we worked with datasets composed of only two independent variables.  

We did for two reasons:  

1.Because we needed two dimensions to visualize better how Machine Learning models worked (by plotting the prediction regions and the prediction boundary for each model). 


2.Because whatever is the original number of our independent variables, we can often end up with two independent variables by applying an appropriate Dimensionality Reduction technique.  

There are two types of Dimensionality Reduction techniques:  
- Feature Selection  
- Feature Extraction  

---

## **FEATURE SELECTION**    
Requires domain knowledge.  
We end up with 2 independent variables that are among the original independent variables   
See Regression code books  

### **Reasons To Ignore A Feature**    
It's noisy  
It causes overfitting  
It is strongly related (highly correlated) with a feature that's already present  
Additional features slow down training/testing process  

### **Feature =! Information**   
Features and Information are two different things.  
Features attempt to access information but not exactly the same as the information itself.  
Like quantity vs quality
Want the bare minimum number of features that gives you as much information as possible.  
- Keep too many = over fit  
- We want to keep just the most powerful and discriminatory   


### Bias-Variance Dilemma and Number of Features  

**High Bias:  ** 
- Pays little attention to data; oversimplified  
- Does the same thing over and over again regardless of what the data might be telling it to do  
- High error on the training set: low r^2 or large SSE    
- When few features are used  

**High Variance:**   
- Pays too much attention to the data (does not generalize well); overfits  
- Just memorizing the training examples and as soon as it gets a new data point that isn't like the training examples, doesn't know what to do   
- Much higher error on the test set than on training set  
- Using many feautres, carefully optimized performance on training data  
- Classic way to overft an algorithm is by using alot of features and not alot of training data  

**Sweet Spot: want to fit an algorithm with ** 
- Few features 
- Large r^2 or low SSE 
 
### **Univariate Feature Selection**  
There are several go-to methods of **automatically** selecting your features in sklearn. Many of them fall under the umbrella of **univariate feature selection, which treats each feature independently** and asks how much power it gives you in classifying or regressing.

There are two big univariate feature selection tools in sklearn: SelectPercentile and SelectKBest. The difference is pretty apparent by the names:   
- **SelectPercentile:** selects the X% of features that are most powerful (where X is a parameter)   
    from sklearn.feature_selection import SelectPercentile, f_classif  
    selector = SelectPercentile(f_classif, percentile = 10)    
    
    
- **SelectKBest:** selects the K features that are most powerful (where K is a parameter).


A clear candidate for feature reduction is text learning, since the data has such high dimension.

### **Other Feature Selection Techniques:**
- Backward Elimination,
- Forward Selection, 
- Bidirectional Elimination, 
- Score Comparison and more.   
 
**Greedy Method**
- Build a classifier for each individual feature, pick the best one via cross-validation  
- Build another set of classifier, all of which contain the first (best) feature, and one other feature. 
- Pick the best via cross-validation. Now you have two features  
- Repeat.  

### Feature Selection (aka Regularization) in Regression 
Power of regualization is that it can automatically do this selection for you.   
 
**Ordinary Multivariate Regression v Lasso Regression **    

**Regular Linear Regression:**  
Just want to minimize SSE   
Uses all the features made available to it and it'll assign each one a coefficient of regression 

**Lasso regression:**  
Regularized regression  
Method for automatically penalizing extra features(for features that don't help the regression results enough, can set coefficient to a feature to zero)  

In addition to minimizing SSE, also what to minimize the number of features that I'm using.  So add in a penalty parameter for additional features.    
- The gain in terms of precision/goodness of fit has to be bigger than the loss that I take as a result of having that additional feature in my regression.    
- Automatically takes in account the penalty parameter and in so doing, it helps you figure out which feature has the most important effect on the regression and elimiate (or set to zero) the coefficients to the features that basically don't help.  

Will try adding features in one at a time and if the new feature does't improve the fit enough to outweigh the penalty term of including that feature then it won't be added (coefficient is set to zero).

**Lasso in sklearn**  
#Fit  
import sklearn.linear_model.Lasso   
features, labels = GetMyData()    
regression = Lasso()  
regression.fit(features, labels)   # supervised learning so need to fit with features and labels 

#Predict    
regression.predict([2,4]) #pass features you want to make predictions for 

#Score    
print regression.coef_   # to see which features have large coefficients


## **FEATURE EXTRACTION**  

## **1.	PRINCIPAL COMPONENT ANALYSIS (PCA)**   

### Overview  
PCA is not a full machine learning algorithm, but instead an unsupervised learning algorithm. It is a transformation of the data and attempts to identify what features explain the most variance in the data.

From the m independent variables of your dataset, PCA extracts p<= m **NEW** independent variables that explain the most variance of the dataset, regardless of the dependent variable.    

End up with 2 independent variables that are **NEW** (as opposed to feature selection where we end up with 2 independent variables that are among the original independent variables).     

The fact that DV (dependent variable) is not considered makes PCA an unsupervised model.  

Apply PCA after data pre-processing and scaling, but **before** you fit the model.   

### **PCA is a Powerful Algorithm**
- For dimensionality reduction. Can PCA to bring down the dimensionality of your features to turn a bunch of features into just a few.  
- Also powerful as a stand alone method in its own right for unsupervised learning  


### Review/Definition of PCA  
1. Systematized way to transform input features into their principal components   

2. Use those Principal Components as new features (the principal coponents are available to you to use instead of original input features)

3. Principal Components are directions in data that maximize variance (minimize information loss) when you project/compress down onto them.   

4. More variance of data along a PC, higher that PC is ranked 
    - Most variance/most information = First PC
    - Second  mostvariance (without overlapping with first PC) = Second PC    

5. Max Number of PC's = number of input features you had in your data set
    - Usually only use the first handful of Principal Components  
    - Could go all the way out and use the max number, but in that case, you're not really gaining anything. Just representing your features in a different way.   

### When to use PCA  
**Latent features driving the patterns in data (big shots @ Enron)**  
- if you want access to latent features you think might be showing up in the data
- or trying to figure out if there is a latent feature (size of first PC)  

**Dimensionality Reduction **  
- **Help you visualize high-dimensional data**  
When you draw a scatter plot, only have two dimensions that are available to you but many times you'll have more than two features.  So there's a struggle of how to represent 3 or 4 numbers about a data poitn if you only have two dimension in which to draw it. What you can do, is you can project it down to the first 2 PC's and just plot that.  So things like K-means can be alot easier to visualize.  Still capturing most of the info in the data but now you can draw it with those two dimensions.  
    
    
- **Reduce noise(almost all data will have noise).**  
The first and second PC's are capturing the actual patterns in the data and the smaller PC's are just representing noisy variations about those patterns.  By throwing away the less important PC's ,gettgn rid of that noise.  
    - Make other algorithms (regression, classification) work better because few inputs (eigenfaces)  
    - Use PCA as pre-processing before you use another alogirthm   

### ** Pros: **
All outputs are uncorrelated (no redundancy)  
Outputs are sorted by information contained (measure by variance)  
We choose enough features such that we retain 95% or 99% of the original variance (this is feature selection)    
Automatic: doesn’t require domain knowledge      
Get a good idea of what it is so you can use it everytime you want to visualize data  

### **Cons: **
It’s only a linear transformation  

### **PCA for Data Transformation**  
If you're given data of any shape, PCA  finds the new coordinate system that's obtained from the old one (by translation and rotation only) and then...
- Moves the center of the coordinate system with the center of the data  
- Moves the x axis into the pricinal axis of variation (where you see the most variation relative to all the data points) 
- Moves the further axis so it is orthogonal (makes a right angle) in the less important direction of variation  

PCA finds for you these axis and also tells you how important these axes are (importance value/spread value)  

### ** Measurable vs Latent Features**  
 
Business Problem: Given the features of a home, what is its price?  

**Measurable Variables**  
sq footage, numbers of rooms, school ranking, neighborhood safety  

**Latent Variable**   
Variables that you can't measure directly, but are driving the phenomenon you're measuring behind the scenes.  
Even though you're measuring all these things (sq footage, # of rooms, school ranking, neighborhood safety), you're only really probing two things: size of home and neighborhood.

What is the best way to condense our 4 features into 2 so that we really get to the heart of the information (probing the size and neighborhood)? 

If you have many features, but hypothesize a smaller number of features actually driving the patterns.  Try making a composite feature (called principle component) that more directly probes the underlying phenomenon.  


### How PCA works  
Does a projection of a two dimensional feature space into one dimension.      
Data points project shadows onto a line (the Principal Component) and then that line is turned sideways so it's 1 dimensional.  

Projected line is the Principal Component  
Project down datapoints onto that Principal Component.  

**Housing Prices Example**  
Combining number of rooms and sq footage into a size feature.   
Combining safety problems and school ranking into one feature that roughly gauges the quality of the neighborhood.    


### **How to Determine the Principal Component**  
Variance in ML is the willingness/flexiblity of an algorithm to learn  
Variance is also a technical term in statistics: roughly the spread of the data distribution (similar to standard deviation). 

What determines the Principal Component of a dataset is the direction that has the largest variance in the data  
Can look simliar to a regression, but PCA is trying to find the direction of maximum variance (not make a prediction!)  
- Use this direction because it retains most amount of 'information' in the original data   
- We project along the direction of the largest variance because this retains the max amount of info in the original data.  
    - Amount of information lost is the distance from data point in the 2d space to new projected spots on the one dimensional line (Principal Component). 
    - Projection is made onto direction of maximal variance that minimizes distance from old (higher dimensional) data and its new transformed value - thereby minimizing information loss.  
    
### Selecting a Number of Principal Components  
Train on different number of PC's and see how accuracy responds - cut off when it becomes apparent that adding more PC's doesn't buy you much more discrimination.  

### PCA >  Feature Selection
**DO NOT** want to perform feature selection before you go into PCA.  PCA is going to find a way to combine info from potentally many different features together. So if you throw out information before you do PCA, you're throwing out information PCA might be able to rescue in a sense.  You can do feature selection on the PC's **AFTER** you make them.   

PCA can be computational expensive so an exception might be if you have a large input feature space and you know alot of them are completely irrelevant -- go ahead and toss them out but **proceed with caution**.  

### PCA as a General Algorithm for Feature Transformation    
PCA is a powerful unsupervised learning technique that you can use to fundamentaly understand the latent features in your data. 
If you knew nothing about housing prices, PCA could give you insight that there are two things that seem to drive house prices in general.    

Put all the features into the PCA together and it can automatically 
- Combine them into new features and 
- Rank the relative power of those features.  
    - First PC: most effect; harder to make interpretations because it could be a mixture that contains bits and pieces from potentially all of the features    
    - Second PC... and so on 


### PCA in sklearn  
from sklearn.decomposition import PCA          
pca = PCA(n_components = 2)        
pca.fit(data)        
return pca     

pca = doPCA()   
print pca.explained_variance_ratio_    
first_pc = pca.components_[0]      
second_pc = pca.components_[1]    

transformed_data = pca.transform(data)    
for ii, jj in zip(transformed_data, data):    
    plt.scatter( first_pc[0]*ii[0], first_pc[1]*ii[0], color = 'r')  
    plt.scatter( second_pc[0]*ii[1], second_pc[1]*ii[1], color = 'c')  
    plt.scatter( jj[0], jj[1], color = 'b')    
    
plt.xlabel('bonus')    
plt.ylabel('long-term incentive')    
plt.show()  


---


## **2.	LINEAR DISCRIMINANT ANALYSIS (LDA)**  
a.	From the n independent variables of the dataset, LDA extracts p <= n new independent variables that separate the most of the classes of the dependent variable  
b.	Considers the dependent variable, makes LDA a supervised model    


---

## **3.	KERNEL PCA**  
a.	Non-linear feature extraction model  


---

## **4.	QUADRATIC DISCRIMINANT ANALYSIS (QDA)**  

Dimensionality reduction attempts to distill higher-dimensional data down to a smaller number of dimensions, while preserving as much of the variance in the data as possible.    
•	lets you distill multi-dimensional data down to fewer dimensions, selecting new dimensions that preserve variance in the data as best it can.    
•	Dimensions: features    
•	Explained variance ratio: how much of the variance in the original data was preserved as I reduced it down to two dimensions
•	PCA has chosen the remaining two dimensions well enough that we've captured 92% of the variance in our data in a single dimension alone! The second dimension just gives us an additional 5%; altogether we've only really lost less than 3% of the variance in our data by projecting it down to two dimensions.