# Compressing Data via Dimensionality Reduction
This notebook will explore two fundamental techniques that help summarize the information content of a dataset by transforming it into a new feature subspace with lower dimensionality than the original. Specifically, we will focus on **Principal Component Analysis (PCA)** and **Linear Discriminant Analysis (LDA)** for linear dimensionality reduction, and **t-Distributed Stochastic Neighbor Embedding (t-SNE)** for nonlinear dimensionality reduction.

# Topics Covered
1. **Introduction to Dimensionality Reduction**

2. **Principal Component Analysis (PCA)**
   - Explanation of how PCA works.
   - Extracting the principal components step by step
   - Application of PCA on a sample dataset.
   - Visualization of results in the reduced feature space.

3. **Linear Discriminant Analysis (LDA)**
   - Understanding LDA and its connection to classification tasks.
   - Application of LDA on a sample dataset.
   - Visualization of LDA-transformed data.

4. **t-Distributed Stochastic Neighbor Embedding (t-SNE)**
   - Overview of t-SNE for nonlinear dimensionality reduction.
   - Application of t-SNE for visualizing high-dimensional data.
   - Discussion of t-SNE’s strengths and limitations.


### 1.Introduction to Dimensionality Reduction

Overview of the need for dimensionality reduction
In modern datasets, it is common to encounter high-dimensional data with numerous features or variables. While more features can provide richer information, high dimensionality often leads to several challenges, such as:

1.**Curse of Dimensionality**: As the number of dimensions increases, the volume of the feature space grows exponentially, making the data sparser. This sparsity can degrade the performance of machine learning models, as it becomes harder to find meaningful patterns and relationships between features.

<div style="text-align: center;">
    <img src="../images/Curse_of_Dimensionality_Chart.png" alt="Curse of Dimensionality" />
</div>


2.**Increased Computational Costs**: High-dimensional data requires more memory and computational power for processing, training, and evaluation of machine learning models. This can lead to slower runtimes and higher resource consumption.

3.**Overfitting**: When a model has too many features relative to the number of observations, it risks overfitting, capturing noise in the data rather than the underlying patterns. Dimensionality reduction helps mitigate overfitting by focusing on the most informative features.

4.**Visualization and Interpretation**: It is difficult to visualize and interpret data in high dimensions. Dimensionality reduction techniques enable the projection of data into 2D or 3D spaces, making it easier to understand and analyze the data visually.

By reducing the number of dimensions, we aim to maintain the most important information while simplifying the dataset, leading to faster computations, improved model performance, and more interpretable results.

### 2.Principal Component Analysis (PCA)
#### How PCA work?
**principal component analysis (PCA)**, an unsupervised linear transformation technique that is widely used across different fields, most prominently for feature extraction and dimensionality reduction. Other popular applications of PCA include exploratory data analysis and the denoising of signals in stock market trading, and the analysis of genome data and gene expression levels in the field of bioinformatics.

- **PCA** helps us to identify patterns in data based on the **correlation** between features.
- aims to find the directions of **maximum variance** in high-dimensional data and projects the data onto a new subspace with equal or fewer dimensions than the original one.

<div style="text-align: center;">
    <img src="../images/PCA_01.png" alt="Using PCA to find the directions of maximum variance in a dataset" height="500" />
</div>

In top figure **x1** and **x2** are the original feature axes, and **PC1** and **PC2** are the principal components.


If we use PCA for dimensionality reduction, we construct a d×k-dimensional transformation matrix, W, that allows us to map a vector of the features of the training example, x, onto a new k-dimensional feature subspace that has fewer dimensions than the original d-dimensional feature space. For instance, the process is as follows. Suppose we have a feature vector, x:
$$\mathbf{x = \left [ x_{1}, x_{2}, x_{3}, ..., x_{d}  \right ] x \epsilon \mathbb{R^{d}}}\boldsymbol{}$$

which is then transformed by a transformation matrix: $\mathbf{\boldsymbol{W\epsilon\mathbb{R^{d*k}}}}$
$$\mathbf{\boldsymbol{xW=z}}$$
resulting in the output vector:
$$\mathbf{z = \left [ z_{1}, z_{2}, z_{3}, ..., z_{k}  \right ] z \epsilon \mathbb{R^{k}}}\boldsymbol{}$$

As a result of transforming the original d-dimensional data onto this new k-dimensional subspace
(typically k << d), the first principal component will have the largest possible variance.


let's see PCA in simple steps:
1. Standardize the d-dimensional dataset.
2. Construct the covariance matrix.
3. Decompose the covariance matrix into its eigenvalues and eigenvectors.
4. Sort the eigenvalues by decreasing order to rank the corresponding eigenvectors.
5. Select k eigenvectors, which correspond to the k largest eigenvalues, where k is the dimension ality of the new feature subspace.$$k \leq d$$
6. Construct a projection matrix, **W**, from the “top” k eigenvectors.
7. Transform the d-dimensional input dataset, **X**, using the projection matrix, **W**, to obtain the new k-dimensional feature subspace

#### Extracting the principal components step by step

In [8]:
import pandas as pd
df_wine = pd.read_csv("https://archive.ics.uci.edu/ml/machine-learning-databases/wine/wine.data",
                      header = None)
df_wine.columns = ['Class label', 'Alcohol', 'Malic acid', 'Ash',
                   'Alcalinity of ash', 'Magnesium', 'Total phenols',
                   'Flavanoids', 'Nonflavanoid phenols', 'Proanthocyanins',
                   'Color intensity', 'Hue',
                   'OD280/OD315 of diluted wines', 'Proline']

df_wine.head()

Unnamed: 0,Class label,Alcohol,Malic acid,Ash,Alcalinity of ash,Magnesium,Total phenols,Flavanoids,Nonflavanoid phenols,Proanthocyanins,Color intensity,Hue,OD280/OD315 of diluted wines,Proline
0,1,14.23,1.71,2.43,15.6,127,2.8,3.06,0.28,2.29,5.64,1.04,3.92,1065
1,1,13.2,1.78,2.14,11.2,100,2.65,2.76,0.26,1.28,4.38,1.05,3.4,1050
2,1,13.16,2.36,2.67,18.6,101,2.8,3.24,0.3,2.81,5.68,1.03,3.17,1185
3,1,14.37,1.95,2.5,16.8,113,3.85,3.49,0.24,2.18,7.8,0.86,3.45,1480
4,1,13.24,2.59,2.87,21.0,118,2.8,2.69,0.39,1.82,4.32,1.04,2.93,735


In [9]:
from sklearn.model_selection import train_test_split
X, y = df_wine.iloc[:, 1:].values, df_wine.iloc[:, 0].values

#Split data into train and test sets. 70% for train and 30% for test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, stratify = y, random_state=0)


Standardize the data

In [10]:
from sklearn.preprocessing import StandardScaler

sc = StandardScaler()
X_train_std = sc.fit_transform(X_train)
X_test_std = sc.transform(X_test)

**Calculate Covariance matrix**:
the covariance between two features, $\mathbf{\boldsymbol{x_{j}}}$ and $\mathbf{\boldsymbol{x_{k}}}$, on the population level can be calculated via the following equation:

$$\mathbf{\boldsymbol{\sigma_{jk}=\frac{1}{n-1}\sum_{i=1}^{n}(x_{j}^{(i)}-\mu_{j})(x_{k}^{(i)}-\mu_{k})}}$$
Here, $\mathbf{\boldsymbol{\mu _{j}}}$ and $\mathbf{\boldsymbol{\mu _{k}}}$ are the samples means of features j and k.
For example, the covariance matrix of three features can then be written as follows:

$$\mathbf{\boldsymbol{\Sigma=\begin{bmatrix}\sigma _{1}^{2}&\sigma _{12}&\sigma _{13}\\\sigma _{21}&\sigma _{2}^{2}&\sigma _{23}\\\sigma _{31}&\sigma _{32}&\sigma _{3}^{2}\\\end{bmatrix}}}$$


We know that an eigenvector, $\mathbf{\boldsymbol{\vartheta}}$, satisfies the following condition.
$$\mathbf{\boldsymbol{\Sigma\vartheta=\lambda\vartheta}}$$
Here, $\mathbf{\boldsymbol{\lambda}}$ is scaler: the eigenvalue.