PCA is used in "feature set compression."

PCA specializes on shifts and rotation for the coordinate system. 

By translational and rotation, PCA finds a new coordinate system and moves the centre of coordinate system from old to new.
It moves x axis into the principal axis of variation where you see the most variation relative to all data points and y axis orthogonal to the new x where there is less variation. It tells you how important the two axes are.

**Principal vectors found by PCA:**
1. PC Vecotr 1: (delta x) and (delta y) for x axis.
2. PC Vecotr 2: (delta x) and (delta y) for y axis.

Each of for x axis and y axis : add to 1 : vector length = square root of sum of squares.

The two vectors are orthogonal i.e ( delta x1 multiply delta y1 ) +  ( delta x2 multiply delta y2 ) = 0 .

PCA finds the centre of data and the principal axis of variation.


PCA also returns an importance value : a spread value for these axes.

The spread value is large for first axis of variation and much smaller for second axis of variation. This number happens to be a **eigen value**. It comes out of an eigenvalue decomposition that's implemented by PCA. It gives you the importance vector , ** how important to take each axis when you look at the data .** 

So when you run the code with PCA, you will find : 

**1.** The new origin.

**2.** Different vectors .

**3.** An importance value , that measures the amount of spread.

>> **PCA and Linear Regression**

1. Its impossible to build a regression that goes vertically up ((i.e constant x with varying y )), because you can't divide the data set as a function of y = f(x). 

2. Regression treats the variables(one is input, one is output) asymmetrically.

3. In PCA, all we get is vectors .
So we can make a coordinate system where the x axis falls vertically with the alined vertically up data , and a perpendicular y-axis .


When we have data like a circle, there could still be a main axis and a secondary axis with PCA.

The both Eigen values can be of same magnitude and we won't gain much by running PCA. So, not always the major axis would be dominating.

>> ** Measurable v/s Latent Features**

*Measurable* : 
1. Sq. footage.
2. No. of rooms.
3. School ranking.
4. Neighbourhood safety.

*Latent* : Variables that can't be measured directly, but that drive the phenomenon behind the scenes.
1. Size.(( Sq. footage & No. of rooms ))
2. Neighbourhood. (( School ranking &  Neighbourhood safety ))

So the 4 Measurable features have been reduced to 2 Latent features.

** How best to condense our 'N' Measurable features to 2 Latent features obtained so that we don't lose essential info ? **

** Which feature selection tool would be most suitable for this ? **  Select KBest or Select Percentile. ?

**A**:  Select KBest because you know how many you want to get out..Here you want to get 2 from the lot available . So it will throw away all except the 2 that are most powerful.

Select Percentile is not good here because you don't already know exactly how many features you have .

1. Many features are present. But let's say we hypothesize that there are a small number of features which actually drive the whole phenomenon , patterns.

2. Try making a composite feature that more directly probes this phenomenon .

These composite features , are called , **Principal Components** .

These are talked in terms of **Dimensionality reduction** .
How you can use PCA to bring down the dimensionality of your features to turn a bunch of features into a few.

>> **  Determining the Principal Component ** : 

**Variance** :  
1. The willingness/flexibility of an algorithm to learn.
2. Technical term in statistics : roughly the spread of a data distribution ( similar to standard deviation)

Principal Component of a dataset is the direction that has the largest variance in the data because it will retain the maximum information from orginal data when we transfer the 2-D data to 1-D data.

((((In regression , you try to make a prediction . In PCA, you try to find the direction of maximum variance)) 

> ** Maximal Variance and Information Loss: **

Amount of info lost = distance between a given point in 2-D and the new spot on the line

Amount of info lost is proportional to that distance for that point.

**Information Loss**  = sum of all the distances (as explained above)  

When we do the projection on to the direction of maximal variance, and only onto that direction , we'll be able to minimizing the information loss.

** You can put all features available into PCA together, and it automatically, gives you : First Principal Component and Second Principal Component. and you need to understand the components actual names i.e what is the main driving phenomenon. **

**Q** Maximum number of PCA's allowed by sklearn.

**A.** It's minimum of the number of training points and the number of features.

> **PCA review **:



1. Systematized way to transform input features into principal components.
2. Those principal componentts are available as new features to use instead of the original input features.
3. Pc’s are directions in data that maximize variance ( minimize information loss ) when you project/ compress down onto those PCA. 
4. Also rank the prinicpal comps---more varaince of data along pc , higher that pc is ranked.
   One that has most variance(most info), will be first princ comp and so on.
5. PC's are all perpendicular to each other, so 2nd PC is guaranteed to not overlap the 1st, 3rd wont overlap 2nd, and so on, so you can treat them as independent features in the sense.
6. Max number of PC's=no. of input features you had in the dataset.Usually, you will use only the first handful of PC's. But can use all.but you wont gain anything..ts just you are reperesenting all features in diff way if u use all.


**When to use PCA:**

1. To have access to latent features driving the patterns in the data. In other words u want to know the size of the first princi comp, to try to figure out whether there is latent feature.((like can you measure who big shots in enron are.))
2. Dimensionality reduction helps :

    a.	Visualize high dimensional data.((if you have many features, make them into PC's and make scatterplots)).
    
    b.	Reduce noise.((1st and 2nd PCA will mostly capture the most of the data, and the next smaller PCA's are just noise)).
    
    c.	Make other algorithms((reg,class)) work better because fewer inputs((eigen faces)) i.e PCA as pre-processing before you use other algo. With high dimensions, the algorithm can be high variance, it might fit the noise to data and can run slow.
    
 



> **PCA in sklearn**

In [None]:
from sklearn.decomposition import PCA

pca=PCA(n_components=2)

pca.fit(data)

# attribues: eigen values and components.
pca.explained_variance_ratio_   # where the eigen values live , how much variation the 1st pca has, how much 2nd pca so on.

first_pc= pca.components_[0]
second_pc= pca.components_[1]


#### facial recog with PCA in sklearn

PCA is also called **EigenFaces** when applied to facial recognition.

from sklearn.decomposition import RandomizedPCA
pca = RandomizedPCA(n_components=n_components, whiten=True).fit(X_train)

eigenfaces = pca.components_.reshape((n_components, h, w)) ## PC's of the face data.

X_train_pca = pca.transform(X_train)
X_test_pca = pca.transform(X_test)


## then create SVM ##


**Perpedicularity observing with PCA:**

The projection step of PCA can be easiest to understand when you subtract out the mean shift of the new principal components, so the new and old dimensions have the same mean.

> ** PCA for facial recognition: **

1. Pictures of faces have high input dimensionality (many pixels)
2. Faces have general patterns that could be captured in smaller number of dimensions.((eyes a bit together, chin types.etc))

In a multiclass classification problem like Facial recognition (more than 2 labels to apply), accuracy is a less-intuitive metric than in the 2-class case. Instead, a popular metric is the **F1 score.**

**Q**: As you add more principal components as features for training your classifier, do you expect it to get better or worse performance?

**A**: Ideally, we hope that adding more components will give us more signal information to improve the classifier performance.

**Q**: If you see a higher F1 score, does it mean the classifier is doing better, or worse?

**A**: Yes, higher means better!

**Q**: Do you see any evidence of overfitting when using a large number of PCs? Does the dimensionality reduction of PCA seem to be helping your performance here?

**A**: Yes, the F1 score(performance) starts to drop with many PC's here ((in facial recog prob)).




> **Selecting the number of PC's:  ** 

**Q** : What's a good way to figure out how many PC's to use.

**A** : Train on different number of PC's and see how accuracy responds - cut off when it becomes apparent that adding more PC's doesn't buy you much more discrimination.


If you do feature selection before putting them into PCA, you are already throwing out some info which PCA could find useful. So you would be losing a lot info. PCA might be able to put to use if not thrown already with feature selection.

It's fine to do feature selection after making PCA