# Compressing Data via Dimensionality Reduction

An alternative approach to feature selection for dimensionality reduction is **feature extraction**. *Data compression* is an important topic in machine learning, and it helps us to store and analyze the increasing amounts of data that are produced and collected in the modern age of technology.

## Unsupervised dimensionality reduction via principal component analysis

Similar to feature selection, we can use different feature extraction techniques to reduce the number of features in a dataset. The difference between feature selection and feature extraction is that while we maintain the original features when we used feature selection algorithms, such as sequential backward selection, we use feature extraction to transform or project the data onto a new feature space. In the context of dimensionality reduction, feature extraction can be understood as an approach to data compression with the goal of maintaining most of the relevant information. 

Popular applications of PCA include exploratory data analyses and de-noising of signals in stock market trading, and the analysis of genome data and gene expression levels in the field of bioinformatics.

### Principal component analysis

PCA aims to find the directions of maximum variance in high-dimensional data and projects it onto a new subspace with equal or fewer dimensions than the original one. The orthogonal axes (principal components) of the new subspace can be interpreted as the directions of maximum variance given the constraint that the new feature axes are orthogonal to each other, as illustrated in the following figure:

<img src="images/pca_orthogonal.jpeg" alt="PCA orthogonal axes" title="PCA orthogonal axes" height="300" width="450">

If we use PCA for dimensionality reduction, we construct a $d \times k$–dimensional transformation matrix $W$ that allows us to map a sample vector $x$ onto a new $k$–dimensional feature subspace that has fewer dimensions than the original d–dimensional feature space:

\begin{equation*}
\begin{matrix}
x = \left[ x_1, x_2, \dots{}, x_d \right], & x \in \mathbb{R}^d \\
\downarrow xW, & W \in \mathbb{R}^{d \times k} \\
z = \left[ z_1, z_2, \dots, z_k \right], & z \in \mathbb{R}^k
\end{matrix}
\end{equation*}

As a result of transforming the original d-dimensional data onto this new k-dimensional subspace (typically k << d), the first principal component will have the largest possible variance, and all consequent principal components will have the largest variance given the constraint that these components are uncorrelated (orthogonal) to the other principal components—even if the input features are correlated, the resulting principal components will be mutually orthogonal (uncorrelated). Note that the PCA directions are highly sensitive to data scaling, and we need to standardize the features prior to PCA if the features were measured on different scales and we want to assign equal importance to all features.

 Let's summarize the approach in a few simple steps:
 
 1. Standardize the $d$-dimensional dataset.
 2. Construct the covariance matrix.
 3. Decompose the covariance matrix into its eigenvectors and eigenvalues.
 4. Sort the eigenvalues by decreasing order to rank the corresponding eigenvectors.
 5. Select $k$ eigenvectors which correspond to the $k$ largest eigenvalues, where $k$ is the dimensionality of the new feature subspace ($k \leq d$). 
 6. Construct a projection matrix $W$ from the "top" $k$ eigenvectors.
 7. Transform the $d$-dimensional input dataset $X$ using the projection matrix $W$ to obtain the new $k$-dimensional feature subspace.

### Extracting the principal components step by step

In [None]:
# load the wine dataset

In [None]:
# separate training and test data (70, 30)

# standardize the features

After completing the mandatory preprocessing by executing the preceding code, let's advance to the second step: constructing the covariance matrix. The symmetric $d \times d$-dimensional covariance matrix, where $d$ is the number of dimensions in the dataset, stores the pairwise covariances between the different features. For example, the covariance between two features and on the population level can be calculated via the following equation:

\begin{equation*}
\sigma_{jk} = \frac{1}{n} \sum_{i=1}^n \left( x_j^{(i)} - \mu_j \right) \left( x_k^{(i)} - \mu_k \right)
\end{equation*}

Here, $\mu_j$ and $\mu_k$ are the sample means of features $j$ and $k$, respectively. Note that the sample means are zero if we standardized the dataset. A positive covariance between two features indicates that the features increase or decrease together, whereas a negative covariance indicates that the features vary in opposite directions. For example, the covariance matrix of three features can then be written as follows (note that $Sigma$ stands for the Greek uppercase letter sigma, which is not to be confused with the *sum* symbol):

\begin{equation*}
\Sigma = \begin{bmatrix}
\sigma_1^2 & \sigma_{12} & \sigma_{13} \\
\sigma_{21} & \sigma_2^2 & \sigma_{23} \\
\sigma_{31} & \sigma_{32} & \sigma_3^2
\end{bmatrix}
\end{equation*}

The eigenvectors of the covariance matrix represent the principal components (the directions of maximum variance), whereas the corresponding eigenvalues will define their magnitude. In the case of the Wine dataset, we would obtain 13 eigenvectors and eigenvalues from the $13 \times 13$-dimensional covariance matrix.

For the third step, let's obtain the eigenpairs of the covariance matrix. As we remember from our introductory linear algebra classes, an eigenvector v satisfies the following condition:

\begin{equation*}
\Sigma{}v = \lambda{}v
\end{equation*}

Here, $\lambda$ is a scalar: the eigenvalue.

In [None]:
# compute eigenvalues

### Total and explained variance

Since we want to reduce the dimensionality of our dataset by compressing it onto a new feature subspace, we only select the subset of the eigenvectors (principal components) that contains most of the information (variance). The eigenvalues define the magnitude of the eigenvectors, so we have to sort the eigenvalues by decreasing magnitude; we are interested in the top k eigenvectors based on the values of their corresponding eigenvalues. But before we collect those $k$ most informative eigenvectors, let us plot the **variance explained** ratios of the eigenvalues. The variance explained ratio of an eigenvalue $\lambda_j$ is simply the fraction of an eigenvalue $\lambda_j$ and the total sum of the eigenvalues:

\begin{equation*}
\frac{\lambda_j}{\sum_{j=1}^d \lambda_j}
\end{equation*}



In [None]:
# with cumsum we can calculate the cumulative sum of expained variances
# show the individual and cumulative eplained variance

The explained variance plot reminds us of the feature importance values that we computed via random forests, we should remind ourselves that PCA is an unsupervised method, which means that information about the class labels is ignored. Whereas a random forest uses the class membership information to compute the node impurities, variance measures the spread of values along a feature axis.

## Feature transformation

Now let's proceed with the last three steps to transform the Wine dataset onto the new principal component axes. The remaining steps we are going to tackle in this section are the following ones:

* Select $k$ eigenvectors, which correspond to the $k$ largest eigenvalues, where $k$ is the dimensionality of the new feature subspace ($k \leq d$). 
* Construct a projection matrix $W$ from the "top" k eigenvectors.
* Transform the d-dimensional input dataset $X$ using the projection matrix $W$ to obtain the new $k$-dimensional feature subspace.

Or, in less technical terms, we will sort the eigenpairs by descending order of the eigenvalues, construct a projection matrix from the selected eigenvectors, and use the projection matrix to transform the data onto the lower-dimensional subspace.

In [None]:
# Make a list of (eigenvalue, eigenvector) tuples

# eigen_pairs

# sort the (eigenvalue, eigenvector) tuples from high to low


Next, we collect the two eigenvectors that correspond to the two largest eigenvalues, to capture about 60 percent of the variance in this dataset. In practice, the number of principal components has to be determined by a trade-off between computational efficiency and the performance of the classifier.

In [None]:
# we created a 13 x 2-dimensional projection matrix W from top two eigenvectors.


Using the projection matrix, we can now transform a sample $x$ (represented as a $1 \times 13$-dimensional row vector) onto the PCA subspace (the principal components one and two) obtaining $x^{\prime}$, now a two-dimensional sample vector consisting of two new features:

\begin{equation*}
x^{\prime} = xW
\end{equation*}

In [None]:
# print x_train_std[0]

Similarly, we can transform the entire $124 \times 13$-dimensional training dataset onto the two principal components by calculating the matrix dot product:

In [None]:
#calculate the matrix dot product

In [None]:
# les's visualize the transformed Wine training set.

As we can see in the resulting plot, the data is more spread along the x-axis—the first principal component—than the second principal component (y-axis), which is consistent with the explained variance ratio plot that we created in the previous subsection. However, we can intuitively see that a linear classifier will likely be able to separate the classes well.

**We have to keep in mind that PCA is an unsupervised technique that doesn't use any class label information.**

### Principal component analysis in scikit-learn

Now, we will discuss how to use the PCA class implemented in scikit-learn. 

In [None]:
# using PCA from sckit learn


In [None]:
# test the transformed test dataset


If we are interested in the explained variance ratios of the different principal components, we can simply initialize the PCA class with the n_components parameter set to None, so all principal components are kept and the explained variance ratio can then be accessed via the explained_variance_ratio_ attribute:

In [None]:
# show PCA

Note that we set n_components=None when we initialized the PCA class so that it will return all principal components in a sorted order instead of performing a dimensionality reduction.