Problem & Solution
Problem: Gone are the days when you had 5 variables to fit your linear regression: Modern datasets contain more variables/features to choose from. A dataset with 50 or more features -> more than 1 million observations.

Solution: Dimensionality Reduction using Feature Selection and Feature Extraction.

### Feature Importance and Feature Selection

Feature Importance refers to techniques that assign a score to input features based on how useful they are at predicting a target variable.

Feature Selection is the process where you automatically or manually select features which contribute most to your target variable.

In short, Feature Importance Scores are used for performing Feature Selection.

### Feature Extraction

Feature Extraction is a feature reduction process. Unlike feature selection, which ranks the existing attributes according to their significance, feature extraction actually transforms the features.

The key difference between feature selection and extraction is that feature selection keeps a subset of the original features while feature extraction creates brand new ones.

Feature extraction is the name for methods that select and/or combine variables into features, effectively reducing the amount of data that must be processed, while still accurately and completely describing the original dataset.

Feature Extraction and Selection


**Feature selection — Selecting the most relevant attributes.**

**Feature extraction — Combining attributes into a new, reduced set of features.**

### Curse of Dimensionality

The curse of dimensionality refers to all the problems that arise when working with data in the higher dimensions, that did not exist in the lower dimensions.

As the number of features increases, the model becomes more complex. The more the number of features, the more the chances of overfitting. A machine learning model that is trained on a large number of features, gets increasingly dependent on the data it was trained on and in turn overfitted, resulting in poor performance on real data, defeating the purpose of the model.


Dimensionality Reduction
In machine learning, we may have too many factors on which the final classification is done. These factors are known as variables.

The higher the number of features, the harder it gets to visualize the training set and then work on it. Sometimes, most of these features are correlated, and hence redundant.

This is where dimensionality reduction algorithms come into play.

Benefits of performing Dimensionality Reduction
Reduces Overfitting: Less redundant data means less opportunity to make decisions based on noise(irrelevant data).
Improves Model Performance: Less misleading data means our model’s performance improves.
Reduces Training Time: Less data means that algorithms train faster.
Utilize Unlabelled Data: Most feature extraction techniques are unsupervised. You can train your autoencoder or fit your PCA on unlabeled data. This can be helpful if you have a lot of unlabeled data and labeling is time-consuming and expensive.
Better Visualization: Reducing the dimensions of data to 2D or 3D may allow us to plot and visualize it precisely. You can then observe patterns more clearly.

### Nonlinear dimreduc Methods

- Kernel PCA
- t-distributed Stochastic Neighbor Embedding (t-SNE)
- Self-Organizing Map (SOM)


Non-linear transformation methods also known as manifold learning methods are used when the data doesn’t lie on a linear subspace. It is based on the manifold hypothesis which says that in a high dimensional structure, most relevant information is concentrated in small number of low dimensional manifolds.

If a linear subspace is a flat sheet of paper, then a rolled up sheet of paper is a simple example of a nonlinear manifold. Informally, this is called a Swiss roll, a canonical problem in the field of non-linear dimensionality reduction.

more methods include:

Multi-dimensional scaling (MDS) : A technique used for analyzing similarity or dissimilarity of data as distances in a geometric spaces. Projects data to a lower dimension such that data points that are close to each other (in terms if Euclidean distance) in the higher dimension are close in the lower dimension as well.
Isometric Feature Mapping (Isomap) : Projects data to a lower dimension while preserving the geodesic distance (rather than Euclidean distance as in MDS). Geodesic distance is the shortest distance between two points on a curve.
Locally Linear Embedding (LLE) : Recovers global non-linear structure from linear fits. Each local patch of the manifold can be written as a linear, weighted sum of its neighbours given enough data.
Hessian Eigenmapping (HLLE) : Projects data to a lower dimension while preserving the local neighbourhood like LLE but uses the Hessian operator (a mathematical operator you don’t need to worry about right now) to better achieve this result and hence the name.
Spectral Embedding (Laplacian Eigenmaps) : Uses spectral techniques to perform dimensionality reduction by mapping nearby inputs to nearby outputs. It preserves locality rather than local linearity.

***

Pca is a linear, method, but kernel PCA is a bit like the SVMs.

You can choose whatever kernel (mathematical function like linear, polynomial, etc.) to perform PCA on

## Kernel PCA

Problem: These classes are linearly inseparable in the input space. (in 2 dimensions)

Solution: High-Dimensional Mapping. We can make the problem linearly separable by a simple mapping: (from 2 dim to 3 dim)


High-dimensional mapping can seriously increase computation time.
Can we get around this problem and still get the benefit of high Dimension?
Yes! Using the Kernel Trick. Kernel PCA extends conventional principal component analysis (PCA) to a high dimensional feature space using the “kernel trick”.




In [None]:
from sklearn.decomposition import KernelPCA
transformer = KernelPCA(n_components=6, kernel='polynomial')
transformed = transformer.fit_transform(data)

***

## t-distributed Stochastic Neighbor Embedding (tSNE)

t-Distributed Stochastic Neighbor Embedding (t-SNE) is a non-linear technique for dimensionality reduction that is particularly well suited for the visualization of high-dimensional datasets. It is extensively applied in image processing, NLP, genomic data and speech processing.

Until recently, it was actually the best, state-of-the-art dimensionality reduction technique. Now, it has been replaced by a better and faster technique called UMAP.

t-SNE basically decreases the multi-dimension to 2d or 3d dimensions such that it can be visualized by the human eyes. The data analysis work will be decreased as it can reveal various patterns in the data set in 2d or 3d.

t-SNE Strengths
Works well for Non-Linear data: It is able to interpret the complex relationship between features and represent similar data points in high dimension to close together in low dimension.
Preserves Local and Global Structure: t-SNE is capable of preserving the local and global structure of the data. This means, points that are close to one another in the high-dimensional dataset, will tend to be close to one another in the low dimension.
t-SNE Weakness
Dimensionality reduction for other purposes: ex: BAD for feature selection/feature extraction because it is based on probability distribution -> only for visualization!
Curse of intrinsic dimensionality (sensitive to intrinsic dimension): Intrinsic Dimension is the no. of variables are needed to generate a good approximation of the signal. Performs badly if high dimensional data actually have high intrinsic dimension.
Non-convexity of the t-SNE cost function: several optimization parameters need to be chosen.

In [None]:
from sklearn.manifold import TSNE

model = TSNE(n_components=2, perplexity=50, n_iter=5000)
tsne = model.fit_transform(stdized_data)

Self-Organizing Map (SOM) or Self-Organizing Feature Map (SOFM)
A self-organizing map (SOM) is a type of artificial neural network (ANN) that is trained using unsupervised learning to produce a low-dimensional (typically two-dimensional), discretized representation of the input space of the training samples, called a map, and is therefore a method to do dimensionality reduction.

In [None]:
!pip install minisom

[https://algobeans.com/2017/11/02/self-organizing-map/]
tut