# Feature Extraction
* Feature Extraction is one of the types in Dimensionality Reduction techniques. 
* Feature extraction attempts to reduce the features by transforming the existing features into required number of features. It doesn't discard the features as opposed to **Feature Selection.**
* Refer my notebook to know more about **Feature Selection Techniques**

[Notebook Link](https://www.kaggle.com/srivignesh/feature-selection-techniques)

## Feature Extraction techniques
1. Principal Component Analysis (PCA)
2. Multi-Dimensional Scaling (MDS)
3. T -distributed Stochastic Neighbor Embedding (T-SNE )
4. Locally Linear Embedding (LLE)

**Advantages:** Feature Extraction Techniques are used for data visualization purposes and to prevent overfitting.

**Disadvantage:** Feature Extraction leads to Information loss.


In [None]:
import numpy as np 
import pandas as pd 
import matplotlib.pyplot as plt
import plotly.express as px
from sklearn.decomposition import PCA
from sklearn.manifold import MDS, TSNE

pd.set_option('max.rows',500)
pd.set_option('max.columns',80)

# Load the preprocessed data
The training data has been preprocessed already. The preprocessing steps involved are,

* MICE Imputation
* Log transformation
* Square root transformation
* Ordinal Encoding
* Target Encoding
* Z-Score Normalization

For detailed implementation of the above mentioned steps refer my notebook on data preprocessing:

[Notebook Link](https://www.kaggle.com/srivignesh/data-preprocessing-for-house-price-prediction)

In [None]:
preprocessed_train = pd.read_csv('../input/preprocessed-train-data/preprocessed_train_data.csv')
x_train, y_train = preprocessed_train[preprocessed_train.columns[:-1]], preprocessed_train[preprocessed_train.columns[-1]]
preprocessed_train.head()

# 1. Principal Component Analysis (PCA)

PCA is a **Linear Dimensionality Reduction** technique. 
* PCA uses eigen vectors of the covariance matrix to orthogonally project the data points to the eigen vector.

**Eigen vector:** Eigen vector is a vector whose direction doesn't change even if a linear transformation is applied to it. 

## PCA Algorithm:
1. Compute covariance matrix
2. Find eigen vectors and eigen values for the covariance matrix
3. The eigen vector with high eigen value explains the information well that means the variance of the data is high
4. Project the data points to the eigen vector.


In [None]:
def PCA_FE(x_train,y_train, n):
    """PCA - Feature Extraction"""
    pca = PCA(n_components= n)
    column_list = []
    for i in range(1,n+1):
        column_list = column_list + ['pc'+str(i)]
    x_train_pca = pd.DataFrame(pca.fit_transform(x_train,y_train),columns = column_list,index=x_train.index)
    return x_train_pca, pca

'''PCA with two dimensions'''
x_train_pca, pca = PCA_FE(x_train, y_train, 2)
display(x_train_pca.head())

''' PCA with three dimensions'''
x_train_pca_3, pca = PCA_FE(x_train, y_train, 3)
display(x_train_pca_3.head())

In [None]:
def plot_pca(x_train_pca):
    '''2D data visualization for PCA'''
    plt.figure(1, figsize=(10,10))
    plt.title('Scatter Plot for PCA',fontsize = 15)
    plt.xlabel('PC1',fontsize =12)
    plt.ylabel('PC2',fontsize =12)
    plt_obj = plt.scatter(x_train_pca['pc1'], x_train_pca['pc2'])
    
plot_pca(x_train_pca)

In [None]:
'''3D data visualization for PCA'''
px.scatter_3d(x_train_pca_3, x = 'pc1', y ='pc2', z= 'pc3')

# 2. Multi-Dimensional Scaling (MDS)

MDS is similar to PCA. MDS is a **Linear Dimensionality Reduction** technique as well.
* The eigen vectors are found for the Dissimilarity matrix as opposed to the covariance matrix in PCA

## MDS algorithm:
1. Compute dissimilarity matrix
2. Find eigen vectors and eigen values for the dissimilarity matrix
3. The eigen vector with high eigen value explains the information well that means the variance of the data is high
4. Project the data points to the eigen vector.

In [None]:
def MDS_FE(x_train, y_train, n):
    """MDS - Feature Extraction"""
    mds = MDS(n_components = n)
    column_list = []
    for i in range(1,n+1):
        column_list = column_list + ['component'+str(i)]
    x_train_mds = pd.DataFrame(mds.fit_transform(x_train,y_train),columns = column_list,index=x_train.index)
    return x_train_mds, mds

'''MDS with two dimensions'''
x_train_mds, mds = MDS_FE(x_train, y_train, 2)
display(x_train_mds.head())

''' MDS with three dimensions'''
x_train_mds_3, mds = MDS_FE(x_train, y_train, 3)
display(x_train_mds_3.head())

In [None]:
def plot_mds(x_train_mds):
    '''2D data visualization for MDS'''
    plt.figure(1, figsize=(10,10))
    plt.title('Scatter Plot for MDS',fontsize = 15)
    plt.xlabel('Component1',fontsize =12)
    plt.ylabel('Component2',fontsize =12)
    plt_obj = plt.scatter(x_train_mds['component1'], x_train_mds['component2'])

plot_mds(x_train_mds)

In [None]:
'''3D data visualization for MDS'''
px.scatter_3d( x_train_mds_3, x='component1', y ='component2', z='component3')

# 3. T-distributed Stochastic Neighbor Embedding (TSNE)
TSNE is an algorithm that identifies the non-linear relationships in the data. It is a tool to visualize data.

T-SNE differs from PCA by preserving only small pairwise distances or local similarities whereas PCA is concerned with preserving large pairwise distances to maximize variance.

* First, t-SNE constructs a probability distribution over pairs of high-dimensional objects in such a way that similar objects are assigned a higher probability while dissimilar points are assigned a very low probability. 
* Second, t-SNE defines a similar probability distribution over the points in the low-dimensional map, and it minimizes the Kullback–Leibler divergence (KL divergence) between the two distributions with respect to the locations of the points in the map. While the original algorithm uses the Euclidean distance between objects as the base of its similarity metric, this can be changed as appropriate.

KL-Divergence is also called relative entropy which is a measure of how two distributions vary.

**Reference:** https://en.wikipedia.org/wiki/T-distributed_stochastic_neighbor_embedding

In [None]:
def TSNE_FE(x_train, y_train, n):
    """T-SNE - Feature Extraction"""
    tsne = TSNE(n_components = n)
    column_list = []
    for i in range(1,n+1):
        column_list = column_list + ['component'+str(i)]
    x_train_tsne = pd.DataFrame(tsne.fit_transform(x_train,y_train),columns = column_list,index=x_train.index)
    return x_train_tsne, tsne

'''TSNE with two dimensions'''
x_train_tsne, tsne = TSNE_FE(x_train, y_train, 2)
display(x_train_tsne.head())

'''TSNE with three dimensions'''
x_train_tsne_3, tsne = TSNE_FE(x_train, y_train, 3)
display(x_train_tsne_3.head())

In [None]:
def plot_tsne(x_train_tsne):
    '''2D data visualization for MDS'''
    plt.figure(1, figsize=(10,10))
    plt.title('Scatter Plot for TSNE',fontsize = 15)
    plt.xlabel('Component1',fontsize =12)
    plt.ylabel('Component2',fontsize =12)
    plt_obj = plt.scatter(x_train_tsne['component1'], x_train_tsne['component2'])

plot_mds(x_train_tsne)

In [None]:
'''3D data visualization for TSNE'''
px.scatter_3d( x_train_tsne_3, x='component1', y ='component2', z='component3')