## Dimensionality Reduction

Many tasks that data scientists handle involve humongous datasets with many variables. At the end of the day we as data scientists wish to take all these variables that we have collected (the given features) and gather important and valuable information. <br>
But how can we do it if we already have too much features to start with. In order to solve this problem we use **dimensionality reduction** to reduce the dimension of our data,meaning that we wish to use less features. <br>
There are two common ways of dimensionality reduction: <br>
1.Feature Eliminaition. <br>
2.Feature Extraction.<br>
Feature elimination is much easier to do than feature extraction but it's main problem is that we may **lose data** while doing so. Feature extraction is a lot harder to do but it's main cause is to keep important variations in our data that may lead to valuable discoveries while slimming the data's dimensionality. <br>

#### PCA
PCA stands for Principal Component Analysis is a way to reduce the dimensionality of our data using feature extraction. Meaning given n features that describe our data we can use PCA to create $n$ **new features** that are completely **independent** that describes our data, we can then reduce the dimensionality by choosing the $k$ important features of our n new features (of course $k < n$). <br>

**Note**: Even though PCA reduces the dimensionality of our data by keeping all it's features completely independent an important thing to note is that the features that we get as a result of this algorithm are **less interpretable**.

##### How PCA works?
Because the algorithm involves a lot of technical math I will first try to explain the idea behind it. The first thing the algorithm does is create a matrix that will describe the relations between all the features in our data, meaning how each feature effects the other features. We will then we will use this matrix to compute two metrics: the **directions** in which our data flows and the **magnitude** of the datapoints in each direction.<br>
For example in a 2D space we can have the following directions (these graphs are from the [setosa.io blog](http://setosa.io/ev/principal-component-analysis/) check it out)

![title](setosa1.png)


We can see two directions in our data : the red direction and the green direction. Now if we plot our data relatively to both of these directions we will get the following:

![title](setosa2.png)

We can clearly see from the plot above that the red direction is way more powerful and important for our data. So if we will dump the green direction we will get 1D data using the red direction.

In a more general case, for an N-dimensional data we can determine which directions are more important than others and reduce it's dimensionality!

#### Math Math Math



So how can we compute a matrix that describes the relations between each of our features? <br>
Probability theory comes to the rescue!<br>
We can use the [covariance matrix](https://en.wikipedia.org/wiki/Covariance_matrix), in an n-dimensional data with features $x_1,x_2 ... x_n$ the covariance matrix $Cov$ will be a $(n,n)$ dimensional matrix where: $$Cov[i,j] = Covariance\ of\ x_i\ and\ x_j$$ <br>
The [**covariance**](https://en.wikipedia.org/wiki/Covariance) is simply a measure that helps us describe relationships between two variables. Meaning if great values of the first variables correspond with great values of the other the covariance will be positive,if it's in the other direction (greater values in the first yield lesser values in the second) the covariance will be negative.<br>
Formal definition of the covariance:
$$ cov(X,Y) = E[(X - E[X])(Y - E[Y])]$$
The covariance is simply the expected value(mean) of the product of the deviations of X and Y from their expected values. 

#### The Algorithm

Given a matrix $X \in \mathcal{R}^{n\times p}$ that will hold all of our data, meaning there are $n$ training examples and $p$ features. We will go through the following steps:<br>

1.**Mean Normalization**: Take each column of $X$ that represents some feature in our data and substract the value of the mean of this feature from this column. This way we will get that each feature in our data will have a mean of 0 (wondering why this is necessary [read here](https://stats.stackexchange.com/questions/69157/why-do-we-need-to-normalize-data-before-principal-component-analysis-pca?utm_medium=organic&utm_source=google_rich_qa&utm_campaign=google_rich_qa))<br>

2.**Standardization**: We need to decide whether the variance effects the importance of features,if it is than we will leave the data as it is, otherwise we will divide each feature column by it's standard deviation.We will save the new values in our matrix $X$.<br>

3.**Compute Covariance Matrix**: We can compute the covariance matrix of X using the computation below: $$Cov[X] = {X}^\intercal * X $$<br>


4.**Compute eigenvalues and eigenvectors**: Take the covaraince matrix that we already computed and compute it's [eigenvalues and eigenvectors](https://en.wikipedia.org/wiki/Eigenvalues_and_eigenvectors). After computing the eigenvectors create a matrix $P$ that will hold all the eigenvectors in it's columns.<br>

5.**Sort by importance**: Take the eigenvalues that we computed $\lambda_1,\lambda_2,...,\lambda_p$ and sort them from largest to smallest,order the eigenvectors that we saved in $P$ in the same order. Notice that all the eigenvectors saved in the columns of $P$ are independent of one another,because each eigenvector is [orthogonal](https://en.wikipedia.org/wiki/Orthogonality) to other eigenvectors (basically means they are perpendicular).<br>

6.**Multiply X by P**: Take the normalized and standardized matrix $X$(from step 2) and multiply it by our ordered eigenvectors matrix $P$ and save it in a new matrix $Z$:
$$Z = XP$$
The $Z$ matrix holds our data where each cell holds a **linear combination** of our original features where in each cell the **weights are a different eigenvector** where they are ordered by importance.<br>

7.**Decision Time**: We need to decide how many features will we keep and how many will we drop. A common way of doing this is computing the [proportion of variance explained](https://stats.stackexchange.com/questions/22569/pca-and-proportion-of-variance-explained) by each feature and picking the k most important ones (if we want to reduce our n-dimensional data to be k-dimensional).

##### Code
Note: Most of the math heavy parts of the algorithm are already implemented so this will be quite short.

In [5]:
import numpy as np
from numpy.linalg import eig

In [12]:
#Random training data.
X = np.random.randint(100,size=(150,10))

def pca(data,num_components = 2):
    '''
    Perform the pca on a given dataset in order to reduce it's dimensionality
    to use num_components.
    '''
    
    #Mean normalization
    normalized = (data - np.mean(data,axis = 0)) / np.std(data,axis = 0)
    
    #Compute covariance matrix.
    cov = normalized.T.dot(normalized)
    
    #Compute eigenvectors and eigenvalues of this matrix.
    values,vectors = eig(cov)
    
    #Sort the eigenvectors and values in matching order from largest to smallest value.
    sorting_idxs = np.argsort(values)[::-1]
    
    eigenvec_matrix = vectors[:,sorting_idxs]
    
    return data.dot(eigenvec_matrix[:,:num_components])

    